Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.

Slides:



Advertisements
Similar presentations
Xiaoming Sun Tsinghua University David Woodruff MIT
Advertisements

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Many-to-one Trapdoor Functions and their Relations to Public-key Cryptosystems M. Bellare S. Halevi A. Saha S. Vadhan.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
Cuckoo Filter: Practically Better Than Bloom
Precept 6 Hashing & Partitioning 1 Peng Sun. Server Load Balancing Balance load across servers Normal techniques: Round-robin? 2.
Dana Moshkovitz. Back to NP L  NP iff members have short, efficiently checkable, certificates of membership. Is  satisfiable?  x 1 = truex 11 = true.
Indian Statistical Institute Kolkata
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
Introduction to PCP and Hardness of Approximation Dana Moshkovitz Princeton University and The Institute for Advanced Study 1.
Complexity ©D.Moshkovits 1 Hardness of Approximation.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Program Verification as Probabilistic Inference Sumit Gulwani Nebojsa Jojic Microsoft Research, Redmond.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
Hit or Miss ? !!!.  Small size.  Simple and fast.  Implementable with hardware.  Does not need too much power.  Does not predict miss if we have.
Bloom Filters Kira Radinsky Slides based on material from:
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Informed Content Delivery Across Adaptive Overlay Networks J. Byers, J. Considine, M. Mitzenmacher and S. Rost Presented by Ananth Rajagopala-Rao.
Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Look-up problem IP address did we see the IP address before?
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Verification Codes Michael Luby, Digital Fountain, Inc. Michael Mitzenmacher Harvard University and Digital Fountain, Inc.
1 Slides by Asaf Shapira & Michael Lewin & Boaz Klartag & Oded Schwartz. Adapted from things beyond us.
Daniel Kroening and Ofer Strichman Decision Procedures An Algorithmic Point of View Deciding ILPs with Branch & Bound ILP References: ‘Integer Programming’
Objective The student will be able to: solve inequalities using multiplication and division.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Decision Procedures An Algorithmic Point of View
Random Walks and Markov Chains Nimantha Thushan Baranasuriya Girisha Durrel De Silva Rahul Singhal Karthik Yadati Ziling Zhou.
Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Replication and Distribution CSE 444 Spring 2012 University of Washington.
Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.
1 Introduction to Quantum Information Processing CS 467 / CS 667 Phys 467 / Phys 767 C&O 481 / C&O 681 Richard Cleve DC 653 Course.
TinyLFU: A Highly Efficient Cache Admission Policy
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
On Adding Bloom Filters to Longest Prefix Matching Algorithms
Interactive proof systems Section 10.4 Giorgi Japaridze Theory of Computability.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Multi-user Broadcast Authentication in Wireless Sensor Networks Kui Ren, Wenjing Lou, Yanchao Zhang SECON2007 Manar Mahmoud Abou elwafa.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
SATMathVideos.Net A set S consists of all multiples of 4. Which of the following sets are contained within set S? A) S2 only B) S4 only C) S2 and S4 D)
Discrete Time Markov Chains
Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
MA/CSSE 473 Day 09 Modular Division Revisited Fermat's Little Theorem Primality Testing.
2.5 The Fundamental Theorem of Game Theory For any 2-person zero-sum game there exists a pair (x*,y*) in S  T such that min {x*V. j : j=1,...,n} =
민준기.  an algorithm which employs a degree of randomness as part of its logicalgorithm randomness  The algorithm typically uses uniformly random bits.
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008
Lower bounds for approximate membership dynamic data structures
The Variable-Increment Counting Bloom Filter
Streaming & sampling.
563.10: Bloom Cookies Web Search Personalization without User Tracking
Homework 3 As announced: not due today 
Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Y. Kotidis, S. Muthukrishnan,
Network Applications of Bloom Filters: A Survey
By: Ran Ben Basat, Technion, Israel
Bloom filters From Probability and Computing
Presentation transcript:

Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal

Introduction Approximate set membership problem. Trade-off between the space and the false positive probability. Generalize the hashing ideas.

Approximate set membership problem Suppose we have a set S = {s 1,s 2,...,s m }  universe U Represent S in such a way we can quickly answer “ Is x an element of S ? ” To take as little space as possible,we allow false positive (i.e. x  S, but we answer yes ) If x  S, we must answer yes.

Bloom filters Consist of an arrays A[n] of n bits (space), and k independent random hash functions h 1, …,h k : U --> {0,1,..,n-1} 1. Initially set the array to 0 2.  s  S, A[h i (s)] = 1 for 1  i  k (an entry can be set to 1 multiple times, only the first times has an effect ) 3. To check if x  S, we check whether all location A[h i (x)] for 1  i  k are set to 1 If not, clearly x  S. If all A[h i (x)] are set to 1,we assume x  S

Initial with all x1x1 x2x2 Each element of S is hashed k times Each hash location set to y To check if y is in S, check the k hash location. If a 0 appears, y is not in S y If only 1s appear, conclude that y is in S This may yield false positive

The probability of a false positive We assume the hash function are random. After all the elements of S are hashed into the bloom filters,the probability that a specific bit is still 0 is

To simplify the analysis,we can assume a fraction p of the entries are still 0 after all the elements of S are hashed into bloom filters. In fact,let X be the random variable of number of those 0 positions. By Chernoff bound It implies X/n will be very close to p with a very high probability

The probability of a false positive f is To find the optimal k to minimize f. Minimize f iff minimize g=ln(f)  k=ln(2)*(n/m)  f = (1/2) k = ( ) n/m The false positive probability falls exponentially in n/m,the number bits used per item !!

Conclusion A Bloom filters is like a hash table,and simply uses one bit to keep track whether an item hashed to the location. If k=1, it ’ s equivalent to a hashing based fingerprint system. If n=cm for small constant c,such as c=8,then k=5 or 6,the false positive probability is just over 2%. It ’ s interesting that when k is optimal k=ln(2)*(n/m), then p= 1/2. An optimized Bloom filters looks like a random bit-string