1 Hashing, randomness and dictionaries Rasmus Pagh PhD defense October 11, 2002.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Uniform algorithms for deterministic construction of efficient dictionaries Milan Ružić IT University of Copenhagen Faculty of Mathematics University of.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
CPSC 335 Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Cuckoo Filter: Practically Better Than Bloom
Optimal Fast Hashing Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay (Hebrew Univ., Israel)
Hashing21 Hashing II: The leftovers. hashing22 Hash functions Choice of hash function can be important factor in reducing the likelihood of collisions.
Randomized Algorithms Lecturer: Moni Naor Cuckoo Hashing.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Cuckoo Hashing : Hardware Implementations Adam Kirsch Michael Mitzenmacher.
1 Advanced Database Technology Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Spring 2004 March 4, 2004 INDEXING II Lecture based on [GUW,
Data Structures Hash Tables
CS107 Introduction to Computer Science
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Advanced Algorithms for Massive Datasets Basics of Hashing.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Hash Tables1 Part E Hash Tables  
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Hashing General idea: Get a large array
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Data Structures Hashing Uri Zwick January 2014.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
COMPSCI 102 Introduction to Discrete Mathematics.
COSC 2007 Data Structures II
Fast and deterministic hash table lookup using discriminative bloom filters  Author: Kun Huang, Gaogang Xie,  Publisher: 2013 ELSEVIER Journal of Network.
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
Problems and MotivationsOur ResultsTechnical Contributions Membership: Maintain a set S in the universe U with |S| ≤ n. Given an x in U, answer whether.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
1 Hash table. 2 A basic problem We have to store some records and perform the following:  add new record  delete record  search a record by key Find.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Algorithm Analysis (Algorithm Complexity). Correctness is Not Enough It isn’t sufficient that our algorithms perform the required tasks. We want them.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Lecture 11 Data Structures, Algorithms & Complexity Introduction Dr Kevin Casey BSc, MSc, PhD GRIFFITH COLLEGE DUBLIN.
CS201: Data Structures and Discrete Mathematics I Hash Table.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
CS261 Data Structures Hash Tables Open Address Hashing.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Scalability for Search Scaling means how a system must grow if resources or work grows –Scalability is the ability of a system, network, or process, to.
1 Chapter 9 Searching And Table. 2 OBJECTIVE Introduces: Basic searching concept Type of searching Hash function Collision problems.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hashing TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA Course: Data Structures Lecturer: Haim Kaplan and Uri Zwick.
Searching Topics Sequential Search Binary Search.
CS 206 Introduction to Computer Science II 04 / 08 / 2009 Instructor: Michael Eckmann.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Hashing (part 2) CSE 2011 Winter March 2018.
CSCI 210 Data Structures and Algorithms
Scalability for Search
Subject Name: File Structures
Hash functions Open addressing
Randomized Algorithms CS648
Data Structures Introduction
CSE 326: Data Structures Lecture #14
Lecture-Hashing.
Presentation transcript:

1 Hashing, randomness and dictionaries Rasmus Pagh PhD defense October 11, 2002

2 Overview of presentation 1.Introduction to searching a computer’s memory using ”hashing”. 2.New ideas behind some of the results in the thesis. 3.Overview of results in the thesis.

3 PART I PART I Introduction to searching a computer’s memory using ”hashing”.

4 Searching a computer’s memory A basic task in computer science is to search for a certain piece of information (a ”key”) in the memory of a computer. This is often called the dictionary problem. For example, search for information related to ” ”.

5 The sorting approach Keep a sorted list of all information. Number of search steps increases when the amount of information grows Pia Mikael Benno Alice Bengt Holger Børge Petra Jens Pia Mikael Benno Alice Bengt Holger Børge Petra Jens Where to find ? Got it! Petra Holger Bengt

6 The need for speed Many applications perform millions, billions, or even trillions of searches. Users do not want to wait for answers. The amount of data is rapidly increasing – solutions that remain fast for large data set are needed.

7 Idea: Store information in random locations Pia Mikael Benno Alice Bengt Holger Børge Petra Jens Use a ”hash function” to generate and remember random locations. The hashing approach

8 Idea: Store information in random locations Pia Mikael Benno Alice Bengt Holger Børge Petra Jens Use a ”hash function” to generate and remember random locations. The hashing approach Got it! Where to find ?

9 Udi Manber (chief scientist, ): ”The most important techniques behind Yahoo! are: Search time varies, but on average it is great – no matter how much information! hashing, hashing, and hashing”. hashing, hashing, Hashing in the real world hashing, Lots of other critical applications in databases, search engines, algorithms, etc.

10 PART II PART II New ideas behind some of the results in the thesis.

11 Search time guarantee The time for searching is often more critical than the time for updating. Sometimes worst case bounds are important. Pioneering work on dictionaries with worst case search time by Fredman et al. (1982) and Dietzfelbinger et al. (1988). A PROBLEM

Børge Mikael Petra Benno Børge Mikael Petra Benno Cuckoo hashing Idea: The hash function provides two possible locations Holger Bengt Pia Jens Alice Holger Bengt Pia Jens Alice Where to find ? Got it! Not here A NEW SOLUTION

Børge Mikael Petra Benno Børge Mikael Petra Benno New information is inserted by, if necessary, kicking out old information Holger Jens Pia Bengt Alice Holger Jens Pia Bengt Alice Insert ”Harry, ” Cuckooinsertion Cuckoo insertion Pia Børge Harry Insert ”Børge, ” Insert ”Pia, ”

14 Perfect hashing ”Perfect hashing” is hashing without collisions. Some memory to store such a function is necessary. Function description may be stored in fast, expensive memory and the table in cheaper and slower memory. A PROBLEM

15 Another view of cuckoo hashing

16 Hash and displace Where to find ? Idea: Displacements (Tarjan and Yao) A NEW SOLUTION

17 ”Simulating” random functions Quest for practical constructions of hash functions that ”behave sufficiently similar to random functions” initiated by Carter and Wegman in Many constructions suggested, and shown to work well in certain situations. A PROBLEM

18 Uniform hashing Fill hash & displace table with random numbers Twice! A NEW SOLUTION

19 PART III PART III Overview of results in the thesis.

20 Contributions of the thesis  New simple and efficient hashing schemes (cuckoo hashing, hash & displace).  New kinds of hash functions (uniform hashing, dispersing hash functions), with applications.  More efficient deterministic hashing algorithms (static and dynamic).  Detailed investigation of hashing and dictionaries in the ”cell probe” model (upper/lower bounds).  A dictionary using almost minimal space. Mainly theoretical results of mathematical nature, but also some results that may be practically useful.

21 Cuckoo hashing Utilizes nearly half of the hash tables. Searching uses two memory lookups. Insertion takes expected constant time. … when using powerful hash functions. However, very efficient in practice using weaker hash functions. Considerably simpler than other hashing schemes with worst case search time. Joint work with Flemming Friche Rodler

22 Hash and displace For a set of n keys, the analysis needs a table of (2+  )n integers. Table with suitable displacements can be computed in expected O(n) time. The table containing information has size n, i.e., it is completely full. ”Universal” hash functions suffice – fast. Perhaps the most practical such scheme that is theoretically understood

23 Uniform hashing When hashing a set of n keys, one needs tables of O(n) integers. With probability 1-O(n -c ) the new hash function computes independent and uniform values on the set. … when based on powerful hash functions. Previous results required O(n 1+  ) integers. Gives theoretical justification for the widespread ”uniform hashing” assumption Joint work with Anna Östlin

24 Dispersing hash functions Goal: Small use of random bits when hashing. Uniform hashing uses O(n log n) random bits. Universal hashing uses O(log n + log log u) random bits. Dispersing hash functions, introduced in the thesis, may use only O(log log u) random bits. Suffice for, e.g., relational join and element distinctness in expected linear time. No explicit construction – shown to be as hard as finding good explicit extractors.

25 Deterministic dictionary construction Joint paper with Torben Hagerup and Peter Bro Miltersen What if one allows no random bits? We want O(1) query time and linear space. Miltersen (’98) + Hagerup (’99): Reduce in time O(n log n) the general problem to that in a universe of size n 2. Yields time O(n 1+  ). New: An O(n log n) algorithm for computing ”good” displacements for the Tarjan-Yao scheme, which handles universe size n

26 Deterministic dynamic dictionary Known deterministic dynamic dictionaries exhibit a trade-off between update and query time. New trade-off added using the mentioned techniques of Miltersen and Hagerup.

27 Impossibility results There are limits to the efficiency of algorithms. Useful to know if this limit has been reached. Some results from the thesis:  1 adaptive memory probe (as in hash & displace) is optimal for perfect hashing.  2 random access memory probes (as in cuckoo hashing) is worst-case optimal.

28 One-probe search Joint work with Anna Östlin Mikael Benno Holger Jens Holger Benno Mikael Jens Where to find ? ”Hash function” deterministically produces O(log u) possible locations. Probing a random location finds the information with probability 1- . Table size n log u. ”Hash function” not explicit.

29 A succinct dictionary Simple information theory provides a lower bound B on the space usage of a dictionary (say, with no information besides keys). New: A static dictionary with O(1) lookup time using B+o(n)+O(log log u) bits. Improves the lower-order term compared to previous results.

30 Chronology of papers  Low Redundancy in Static Dictionaries with Constant Query Time ICALP 1999 and SIAM Journal on Computing, 2001  Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions WADS 1999  Deterministic Dictionaries With Torben Hagerup and Peter Bro Miltersen SODA and Journal of Algorithms, 2001  A Trade-Off for Worst-Case Efficient Dictionaries SWAT 2000 and Nordic Journal of Computing, 2000  Dispersing Hash Functions RANDOM 2000  On the Cell Probe Complexity of Membership and Perfect Hashing STOC 2001  Cuckoo Hashing With Flemming Friche Rodler ESA 2001  One-Probe Search With Anna Östlin ICALP 2002  Simulating Uniform Hashing in Constant Time and Optimal Space With Anna Östlin Unpublished manuscript, 2002