Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.

Slides:



Advertisements
Similar presentations
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC.
Advertisements

Reconciling Differences: towards a theory of cloud complexity George Varghese UCSD, visiting at Yahoo! Labs 1.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Data Structures Using C++ 2E
Cuckoo Filter: Practically Better Than Bloom
Indian Statistical Institute Kolkata
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
Lecture 10: Parallel Databases Wednesday, December 1 st, 2010 Dan Suciu -- CSEP544 Fall
Bloom Filters Kira Radinsky Slides based on material from:
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.
Lecture 11 March 5 Goals: hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Look-up problem IP address did we see the IP address before?
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Replication and Distribution CSE 444 Spring 2012 University of Washington.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.
Hash Table Theory and chaining. Hash table: Theory and chaining Hash Table Formalism for hashing functions Resolving collisions by chaining.
TinyLFU: A Highly Efficient Cache Admission Policy
Computer Networks Seminar Spring 2007 A Power Management Proxy with a New Best-of-N Bloom Filter Design to Reduce False Positives Miguel Jimeno Ken Christensen.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Spatial Issues in DBGlobe Dieter Pfoser. Location Parameter in Services Entering the harbor (x,y position)… …triggers information request.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hashing. Search Given: Distinct keys k 1, k 2, …, k n and collection T of n records of the form (k 1, I 1 ), (k 2, I 2 ), …, (k n, I n ) where I j is.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Bloom Filters An Introduction and Really Most Of It CMSC 491
Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
Lower bounds for approximate membership dynamic data structures
The Variable-Increment Counting Bloom Filter
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal
Bloom Filters Burton Bloom (1970) Favorite Data Structure
Bloom Filters Burton Bloom (1970) Favorite Data Structure
Edge computing (1) Content Distribution Networks
Bloom Filters Very fast set membership. Is x in S? False Positive
Hashing.
Hash Tables – 2 Comp 122, Spring 2004.
Network Applications of Bloom Filters: A Survey
Hashing Alexandra Stefan.
CS5112: Algorithms and Data Structures for Applications
Database Design and Programming
Bloom filters From Probability and Computing
Hash Functions for Network Applications (II)
Lecture 1: Bloom Filters
Hash Tables – 2 1.
Presentation transcript:

Bloom Filters

Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network Applications of Bloom Filters: A Survey", Internet Mathematics 1 (4): 485–509 Bloom, Burton H. (1970), "Space/time trade-offs in hash coding with allowable errors", Communications of the ACM 13 (7): 422–42 2Dan Suciu -- CSEP544 Fall 2011

Pig Latin Example Continued 3 Users(name, age) Pages(user, url) SELECT Pages.url, count(*) as cnt FROM Users, Pages WHERE Users.age in [18..25] and Users.name = Pages.user GROUP BY Pages.url ORDER DESC cnt SELECT Pages.url, count(*) as cnt FROM Users, Pages WHERE Users.age in [18..25] and Users.name = Pages.user GROUP BY Pages.url ORDER DESC cnt Dan Suciu -- CSEP544 Fall 2011

Example Problem: many Pages, but only a few visited by users with age Pig’s solution: – MAP phase sends all pages to the reducers How can we reduce communication cost ? 4Dan Suciu -- CSEP544 Fall 2011

Hash Maps Let S = {x 1, x 2,..., x n } be a set of elements Let m > n Hash function h : S  {1, 2, …, m} 5 S = {x 1, x 2,..., x n } 12m H=

Hash Map = Dictionary The hash map acts like a dictionary Insert(x, H) = set bit h(x) to 1 – Collisions are possible Member(y, H) = check if bit h(y) is 1 – False positives are possible Delete(y, H) = not supported ! – Extensions possible, see later 6Dan Suciu -- CSEP544 Fall

Example (cont’d) Map-Reduce task 1 – Map task: compute a hash map H of User names, where age in [18..25]. Several Map tasks in parallel. – Reduce task: combine all hash maps using OR. One single reducer suffices. Map-Reduce task 2 – Map tasks 1: map each User to the appropriate region – Map tasks 2: map only Pages where user in H to appropriate region – Reduce task: do the join 7 Why don’t we lose any Pages ?

Analysis Let S = {x 1, x 2,..., x n } Let j = a specific bit in H (1 ≤ j ≤ m) What is the probability that j remains 0 after inserting all n elements from S into H ? Will compute in two steps 8Dan Suciu -- CSEP544 Fall 2011

Analysis Recall |H| = m Let’s insert only x i into H What is the probability that bit j is 0 ? 9Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m Let’s insert only x i into H What is the probability that bit j is 0 ? Answer: p = 1 – 1/m 10Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, S = {x 1, x 2,..., x n } Let’s insert all elements from S in H What is the probability that bit j remains 0 ? 11Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, S = {x 1, x 2,..., x n } Let’s insert all elements from S in H What is the probability that bit j remains 0 ? Answer: p = (1 – 1/m) n 12Dan Suciu -- CSEP544 Fall

Probability of False Positives Take a random element y, and check member(y,H) What is the probability that it returns true ? 13Dan Suciu -- CSEP544 Fall

Probability of False Positives Take a random element y, and check member(y,H) What is the probability that it returns true ? Answer: it is the probability that bit h(y) is 1, which is f = 1 – (1 – 1/m) n ≈ 1 – e -n/m 14Dan Suciu -- CSEP544 Fall

Analysis: Example Example: m = 8n, then f ≈ 1 – e -n/m = 1-e -1/8 ≈ 0.11 A 10% false positive rate is rather high… Bloom filters improve that (coming next) Dan Suciu -- CSEP544 Fall

Bloom Filters Introduced by Burton Bloom in 1970 Improve the false positive ratio Idea: use k independent hash functions 16Dan Suciu -- CSEP544 Fall 2011

Bloom Filter = Dictionary Insert(x, H) = set bits h 1 (x),..., h k (x) to 1 – Collisions between x and x’ are possible Member(y, H) = check if bits h 1 (y),..., h k (y) are 1 – False positives are possible Delete(z, H) = not supported ! – Extensions possible, see later 17Dan Suciu -- CSEP544 Fall 2011

Example Bloom Filter k=3 18 Insert(x,H) Member(y,H) y 1 = is not in H (why ?); y 2 may be in H (why ?)

Choosing k Two competing forces: If k = large – Test more bits for member(y,H)  lower false positive rate – More bits in H are 1  higher false positive rate If k = small – More bits in H are 0  lower positive rate – Test fewer bits for member(y,H)  higher rate 19Dan Suciu -- CSEP544 Fall 2011

Analysis Recall |H| = m, #hash functions = k Let’s insert only x i into H What is the probability that bit j is 0 ? 20Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, #hash functions = k Let’s insert only x i into H What is the probability that bit j is 0 ? Answer: p = (1 – 1/m) k 21Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, S = {x 1, x 2,..., x n } Let’s insert all elements from S in H What is the probability that bit j remains 0 ? 22Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, S = {x 1, x 2,..., x n } Let’s insert all elements from S in H What is the probability that bit j remains 0 ? Answer: p = (1 – 1/m) kn ≈ e -kn/m 23Dan Suciu -- CSEP544 Fall

Probability of False Positives Take a random element y, and check member(y,H) What is the probability that it returns true ? 24Dan Suciu -- CSEP544 Fall 2011

Probability of False Positives Take a random element y, and check member(y,H) What is the probability that it returns true ? Answer: it is the probability that all k bits h 1 (y), …, h k (y) are 1, which is: 25 f = (1-p) k ≈ (1 – e -kn/m ) k

Optimizing k For fixed m, n, choose k to minimize the false positive rate f Denote g = ln(f) = k ln(1 – e -kn/m ) Goal: find k to minimize g 26 k = ln 2 × m /n

Bloom Filter Summary Given n = |S|, m = |H|, choose k = ln 2 × m /n hash functions 27 f = (1-p) k ≈ (½) k =(½) (ln 2)m/n ≈ (0.6185) m/n p ≈ e -kn/m = ½ Probability that some bit j is 1 Expected distribution m/2 bits 1, m/2 bits 0 Probability of false positive

Bloom Filter Summary In practice one sets m = cn, for some constant c – Thus, we use c bits for each element in S – Then f ≈ (0.6185) c = constant Example: m = 8n, then – k = 8(ln 2) = (use 6 hash functions) – f ≈ (0.6185) m/n = (0.6185) 8 ≈ 0.02 (2% false positives) – Compare to a hash table: f ≈ 1 – e -n/m = 1-e -1/8 ≈ 0.11 Dan Suciu -- CSEP544 Fall The reward for increasing m is much higher for Bloom filters

Set Operations Intersection and Union of Sets: Set S  Bloom filter H Set S’  Bloom filter H’ How do we computed the Bloom filter for the intersection of S and S’ ? Dan Suciu -- CSEP544 Fall

Set Operations Intersection and Union: Set S  Bloom filter H Set S’  Bloom filter H’ How do we computed the Bloom filter for the intersection of S and S’ ? Answer: bit-wise AND: H ∧ H’ 30Dan Suciu -- CSEP544 Fall 2011

Counting Bloom Filter Goal: support delete(z, H) Keep a counter for each bit j Insertion  increment counter Deletion  decrement counter Overflow  keep bit 1 forever Using 4 bits per counter: Probability of overflow ≤ × m 31Dan Suciu -- CSEP544 Fall 2011

Application: Dictionaries Bloom originally introduced this for hyphenation 90% of English words can be hyphenated using simple rules 10% require table lookup Use “bloom filter” to check if lookup needed 32Dan Suciu -- CSEP544 Fall 2011

Application: Distributed Caching Web proxies maintain a cache of (URL, page) pairs If a URL is not present in the cache, they would like to check the cache of other proxies in the network Transferring all URLs is expensive ! Instead: compute Bloom filter, exchange periodically 33Dan Suciu -- CSEP544 Fall 2011