Dictionary search Exact string search Paper on Cuckoo Hashing.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing.
Hashing and Indexing John Ortiz.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Advanced Algorithms for Massive Datasets Basics of Hashing.
A Look at Modern Dictionary Structures & Algorithms Warren Hunt.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
Bloom Filters Kira Radinsky Slides based on material from:
Look-up problem IP address did we see the IP address before?
Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.
REPRESENTING SETS CSC 172 SPRING 2002 LECTURE 21.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Advanced Algorithms for Massive Datasets Basics of Hashing.
IR The power of “failing”. TTT 2 Not perfectly true but...
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
An Approach to Generalized Hashing Michael Klipper With Dan Blandford Guy Blelloch.
1 HEXA : Compact Data Structures for Faster Packet Processing Department of Computer Science and Information Engineering National Cheng Kung University,
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Data Structures Hashing Uri Zwick January 2014.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Oct 29, 2001CSE 373, Autumn External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Lecture 12 : Trie Data Structure Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Lecture 5 Cost Estimation and Data Access Methods.
File Processing - Hash File Considerations MVNC1 Hash File Considerations.
Hash Tables - Motivation
CS 206 Introduction to Computer Science II 11 / 16 / 2009 Instructor: Michael Eckmann.
Hashing Chapter 7 Section 3. What is hashing? Hashing is using a 1-D array to implement a dictionary o This implementation is called a "hash table" Items.
Hash Tables. 2 Exercise 2 /* Exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Data Structure & Algorithm Lecture 8 – Hashing JJCAO Most materials are stolen from Prof. Yoram Moses’s course.
Hashing by Rafael Jaffarove CS157b. Motivation  Fast data access  Search  Insertion  Deletion  Ideal seek time is O(1)
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Week 9 - Monday.  What did we talk about last time?  Practiced with red-black trees  AVL trees  Balanced add.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Duke CPS Faster and faster and … search l Binary search trees ä average case insert/search/delete = O( ) ä worst case = O( ) l balanced search.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Dictionary data structures for the Inverted Index
Hashing CENG 351.
HEXA: Compact Data Structures for Faster Packet Processing
Hash Tables.
Auto-completion Search
Faloutsos & Pavlo Lecture#11 (R&G ch. 11) Hashing
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
Speeding Accesses to Data
Dictionary data structures for the Inverted Index
Presentation transcript:

Dictionary search Exact string search Paper on Cuckoo Hashing

Exact String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing

Hashing with chaining

Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m =  (load factor)

Search cost m =  (n)

In practice A trivial hash function is: prime

A “provably good” hash is Each a i is selected at random in [0,m) k0k0 k1k1 k2k2 krkr ≈log 2 m r ≈ L / log 2 m a0a0 a1a1 a2a2 arar K a prime l = max string len m = table size not necessarily: (...mod p) mod m

Cuckoo Hashing ABC ED 2 hash tables, and 2 random choices where an item can be stored

ABC ED F A running example

ABFC ED

ABFC ED G

EGBFC AD

Cuckoo Hashing Examples ABC ED F G Random (bipartite) graph: node=cell, edge=key

Natural Extensions More than 2 hashes (choices) per key. Very different: hypergraphs instead of graphs. Higher memory utilization 3 choices : 90+% in experiments 4 choices : about 97% 2 hashes + bins of B-size. Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory...but more local

Dictionary search Making one-side errors Paper on Bloom Filter

Crawling How to keep track of the URLs visited by a crawler? URLs are long Check should be very fast No care about small errors (≈ page not crawled) Bloom Filter over crawled URLs

Searching with errors...

Problem: false positives

TTT 2

Not perfectly true but...

m/n = 8 Opt k = We do have an explicit formula for the optimal k

Dictionary search Prefix-string search Reading 3.1 and 5.2

Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.

Trie: speeding-up searches y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo Pro: O(p) search time Cons: edge + node labels and tree structure

Front-coding: squeezing strings Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 33  45% 0 ….systile syzygetic syzygial syzygy… Gzip may be much better...

….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo…. systile szaielyite CT on a sample 2-level indexing Disk Internal Memory A disadvantage: Trade-off ≈ speed vs space (because of bucket size) 2 advantages: Search ≈ typically 1 I/O Space ≈ Front-coding over buckets