1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.

Slides:



Advertisements
Similar presentations
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC.
Advertisements

Relational Algebra Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
Reconciling Differences: towards a theory of cloud complexity George Varghese UCSD, visiting at Yahoo! Labs 1.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Data Structures Using C++ 2E
Indian Statistical Institute Kolkata
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
What’s the Difference? Efficient Set Reconciliation without Prior Context Frank Uyeda University of California, San Diego David Eppstein, Michael T. Goodrich.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Lecture 10: Parallel Databases Wednesday, December 1 st, 2010 Dan Suciu -- CSEP544 Fall
Bloom Filters Kira Radinsky Slides based on material from:
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
1 Review #1 l Intro stuff –What is a database, 4 parts, 3 users, etc. l Architecture –Data independence –Three levels, two mappings –Jobs of the DBA.
Look-up problem IP address did we see the IP address before?
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Chapter 19 Binding Protocol Addresses (ARP) Chapter 20 IP Datagrams and Datagram Forwarding.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Replication and Distribution CSE 444 Spring 2012 University of Washington.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Access Path Selection in a Relational Database Management System Selinger et al.
Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Computer Networks Seminar Spring 2007 A Power Management Proxy with a New Best-of-N Bloom Filter Design to Reduce False Positives Miguel Jimeno Ken Christensen.
ITCS373: Internet Technology Lecture 5: More HTML.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
1 Relational Algebra and Calculas Chapter 4, Part A.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
Spatial Issues in DBGlobe Dieter Pfoser. Location Parameter in Services Entering the harbor (x,y position)… …triggers information request.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
 Product Recommendations  Managing Product Recommendations  Types of Product Recommendations  Product Out of Stock  Out of Stock Management  Let.
© Copyright 2009 SSLPost 01. © Copyright 2009 SSLPost 02 a recipient is sent an encrypted that contains data specific to that recipient the data.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Searching CSE 103 Lecture 20 Wednesday, October 16, 2002 prepared by Doug Hogan.
OPERATORS IN C CHAPTER 3. Expressions can be built up from literals, variables and operators. The operators define how the variables and literals in the.
Course Overview - Database Systems
Bloom Filters An Introduction and Really Most Of It CMSC 491
Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008
Updating SF-Tree Speaker: Ho Wai Shing.
The Variable-Increment Counting Bloom Filter
Hash-Based Indexes Chapter 11
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal
Bloom Filters Burton Bloom (1970) Favorite Data Structure
Bloom Filters Burton Bloom (1970) Favorite Data Structure
Bloom Filters Very fast set membership. Is x in S? False Positive
Hash-Based Indexes Chapter 10
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Network Applications of Bloom Filters: A Survey
CS5112: Algorithms and Data Structures for Applications
Database Design and Programming
Bloom filters From Probability and Computing
Hash Functions for Network Applications (II)
Chapter 11 Instructor: Xin Zhang
Lecture 1: Bloom Filters
CSE 326: Data Structures Lecture #14
Presentation transcript:

1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011

Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network Applications of Bloom Filters: A Survey", Internet Mathematics 1 (4): 485–509 Bloom, Burton H. (1970), "Space/time trade- offs in hash coding with allowable errors", Communications of the ACM 13 (7): 422–42 2Dan Suciu -- CSEP544 Fall 2011

Example 3 Users(uid, name, age) Pages(uid, url) SELECT Pages.url, count(*) as cnt FROM Users, Pages WHERE Users.age in [18..25] and Users.uid = Pages.uid GROUP BY Pages.url ORDER DESC count(*) SELECT Pages.url, count(*) as cnt FROM Users, Pages WHERE Users.age in [18..25] and Users.uid = Pages.uid GROUP BY Pages.url ORDER DESC count(*) Dan Suciu -- CSEP544 Fall 2011 Compute this query on Map/Reduce

Example Relational algebra plan: T1  SIGMA{age in [18,25]} (Users) JOIN Pages Answer  GAMMA{url,count(*)}(T1) Map/Reduce program: has one MR job for each line above. Dan Suciu -- CSEP544 Fall 20114

Example Map-Reduce job 1 Map tasks 1: User where age in [18,25])  (uid,User) Map tasks 2: Page  (uid, Page) Reduce task: (uid, [User, Page1, Page2, …])  url1, url2, … (uid, [Page1, Page2, …])  null Map-Reduce job 2 Map task: url  (url, 1) Reduce task: (url, [1,1,…])  (url, count)

Example Problem: many Pages, but only a few visited by users with age How can we reduce the number of Pages sent during MR Job 1? 6Dan Suciu -- CSEP544 Fall 2011

Hash Maps Let S = {x 1, x 2,..., x n } be a set of elements Let m > n Hash function h : S  {1, 2, …, m} S = {x 1, x 2,..., x n } 12m H=

Hash Map = Dictionary The hash map acts like a dictionary Insert(x, H) = set bit h(x) to 1 –Collisions are possible Member(y, H) = check if bit h(y) is 1 –False positives are possible Delete(y, H) = not supported ! –Extensions possible, see later 8Dan Suciu -- CSEP544 Fall

Example (cont’d) Map-Reduce job 1a –Map task: Set of Users  hash map H of User.uid where age in [18..25] –Reduce task: combine all hash maps using OR. One single reducer suffices –Note: there is a unique key (say =1) produce by Map Map-Reduce job 1b –Map tasks 1: User where age in[18,25]  (uid, User) –Map tasks 2: Page where uid in H  (uid, Page) –Reduce task: do the join Why don’t we lose any Pages?

Analysis Let S = {x 1, x 2,..., x n } Let j = a specific bit in H (1 ≤ j ≤ m) What is the probability that j remains 0 after inserting all n elements from S into H ? Will compute in two steps 10Dan Suciu -- CSEP544 Fall 2011

Analysis Recall |H| = m Let’s insert only x i into H What is the probability that bit j is 0 ? 11Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m Let’s insert only x i into H What is the probability that bit j is 0 ? Answer: p = 1 – 1/m 12Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, S = {x 1, x 2,..., x n } Let’s insert all elements from S in H What is the probability that bit j remains 0 ? 13Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, S = {x 1, x 2,..., x n } Let’s insert all elements from S in H What is the probability that bit j remains 0 ? Answer: p = (1 – 1/m) n 14Dan Suciu -- CSEP544 Fall

Probability of False Positives Take a random element y, and check member(y,H) What is the probability that it returns true ? 15Dan Suciu -- CSEP544 Fall

Probability of False Positives Take a random element y, and check member(y,H) What is the probability that it returns true ? Answer: it is the probability that bit h(y) is 1, which is f = 1 – (1 – 1/m) n ≈ 1 – e -n/m 16Dan Suciu -- CSEP544 Fall

Analysis: Example Example: m = 8n, then f ≈ 1 – e -n/m = 1-e -1/8 ≈ 0.11 A 10% false positive rate is rather high… Bloom filters improve that (coming next) Dan Suciu -- CSEP544 Fall

Bloom Filters Introduced by Burton Bloom in 1970 Improve the false positive ratio Idea: use k independent hash functions 18Dan Suciu -- CSEP544 Fall 2011

Bloom Filter = Dictionary Insert(x, H) = set bits h 1 (x),..., h k (x) to 1 –Collisions between x and x’ are possible Member(y, H) = check if h 1 (y),..., h k (y) are 1 –False positives are possible Delete(z, H) = not supported ! –Extensions possible, see later

Example Bloom Filter k=3 Insert(x,H) Member(y,H) y 1 = is not in H (why ?); y 2 may be in H (why ?)

Choosing k Two competing forces: If k = large –Test more bits for member(y,H)  lower false positive rate –More bits in H are 1  higher false positive rate If k = small –More bits in H are 0  lower positive rate –Test fewer bits for member(y,H)  higher rate 21Dan Suciu -- CSEP544 Fall 2011

Analysis Recall |H| = m, #hash functions = k Let’s insert only x i into H What is the probability that bit j is 0 ? 22Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, #hash functions = k Let’s insert only x i into H What is the probability that bit j is 0 ? Answer: p = (1 – 1/m) k 23Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, S = {x 1, x 2,..., x n } Let’s insert all elements from S in H What is the probability that bit j remains 0 ? 24Dan Suciu -- CSEP544 Fall

Analysis Recall |H| = m, S = {x 1, x 2,..., x n } Let’s insert all elements from S in H What is the probability that bit j remains 0 ? Answer: p = (1 – 1/m) kn ≈ e -kn/m 25Dan Suciu -- CSEP544 Fall

Probability of False Positives Take a random element y, and check member(y,H) What is the probability that it returns true ? 26Dan Suciu -- CSEP544 Fall 2011

Probability of False Positives Take a random element y, and check member(y,H) What is the probability that it returns true ? Answer: it is the probability that all k bits h 1 (y), …, h k (y) are 1, which is: f = (1-p) k ≈ (1 – e -kn/m ) k

Optimizing k For fixed m, n, choose k to minimize the false positive rate f Denote g = ln(f) = k ln(1 – e -kn/m ) Goal: find k to minimize g k = ln 2 × m /n

Bloom Filter Summary Given n = |S|, m = |H|, choose k = ln 2 × m /n hash functions f = (1-p) k ≈ (½) k =(½) (ln 2)m/n ≈ (0.6185) m/n p ≈ e -kn/m = ½ Probability that some bit j is 1 Expected distribution m/2 bits 1, m/2 bits 0 Probability of false positive

Bloom Filter Summary In practice one sets m = cn, for some constant c –Thus, we use c bits for each element in S –Then f ≈ (0.6185) c = constant Example: m = 8n, then –k = 8(ln 2) = (use 6 hash functions) –f ≈ (0.6185) m/n = (0.6185) 8 ≈ 0.02 (2% false positives) –Compare to a hash table: f ≈ 1 – e -n/m = 1-e -1/8 ≈ 0.11 The reward for increasing m is much higher for Bloom filters

Set Operations Intersection and Union of Sets: Set S  Bloom filter H Set S’  Bloom filter H’ How do we computed the Bloom filter for the intersection of S and S’ ? Dan Suciu -- CSEP544 Fall

Set Operations Intersection and Union: Set S  Bloom filter H Set S’  Bloom filter H’ How do we computed the Bloom filter for the intersection of S and S’ ? Answer: bit-wise AND: H ∧ H’ 32Dan Suciu -- CSEP544 Fall 2011

Counting Bloom Filter Goal: support delete(z, H) Keep a counter for each bit j Insertion  increment counter Deletion  decrement counter Overflow  keep bit 1 forever Using 4 bits per counter: Probability of overflow ≤ × m 33Dan Suciu -- CSEP544 Fall 2011

Application: Dictionaries Bloom originally introduced this for hyphenation 90% of English words can be hyphenated using simple rules 10% require table lookup Use “bloom filter” to check if lookup needed 34Dan Suciu -- CSEP544 Fall 2011

Application: Distributed Caching Web proxies maintain a cache of (URL, page) pairs If a URL is not present in the cache, they would like to check the cache of other proxies in the network Transferring all URLs is expensive ! Instead: compute Bloom filter, exchange periodically 35Dan Suciu -- CSEP544 Fall 2011

Final Review Dan Suciu -- CSEP544 Fall

The Final Take-home final; Webquiz Posted: Thursday, Dec. 8, 2011, 8pm Closed: Saturday, Dec. 10, 10pm Dan Suciu -- CSEP544 Fall

The Final No software required (no postgres, no nothing) Open books, open notes No communication with your colleagues about the final Dan Suciu -- CSEP544 Fall

The Final You will receive an on Thursday night with two things: –A url with the online version of the final –A pdf file with your final (for your convenience, so you can print it) Answer the quiz online When done: submit, receive confirmation code

The Final Content (tentative) 1.SQL (including views, constraints, datalog, relational calculus) 2.Conceptual design (FDs, BCNF) 3.Transactions 4.Indexes 5.Query execution and optimization 6.Statistics 7.Parallel query processing 8.Bloom filters

Final Comments All the information you need to solve the final can be found in the lecture notes No need to execute the SQL queries To write a Relational Algebra Plan, use the notation on slide 4 You may save, and continue Questions? Send me an , but note that I will be offline all day on Friday