Download presentation

Presentation is loading. Please wait.

Published byLuke Harden Modified over 3 years ago

1
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung

2
The Question Traditional analyses of hashing-based algorithms & data structures assume a truly random hash function. In practice: simple (e.g. universal) hash functions perform just as well. Why?

3
Outline Three hashing applications The new model and results Proof ideas

4
Bloom Filters To approximately store S = {x 1,…,x T } [N]: Start with array of M=O(T) zeroes. Hash each item k=O(1) times to [M] using h : [N] [M] k, put a one in each location. To test y S: Hash & accept if ones in all k locations.

5
Bloom Filter Analysis Thm [B70]: S y S, if h is a truly random hash function, Pr h [accept y] = 2 -(ln 2)·M/T +o(1). for an optimal choice of k.

6
Balanced Allocations Hashing T items into T buckets –What is the maximum number of items, or load, of any bucket? –Assume buckets chosen independently & uniformly at random. Well-known result: (log T / log log T) maximum load w.h.p.

7
Power of Two Choices Suppose each ball can pick two bins independently and uniformly and choose the bin with less load. Thm [ABKU94]: maximum load log log n / log 2 + (1) w.h.p.

8
Linear Probing Hash elements into an array of length M. If h(x) is already full, try h(x)+1,h(x)+2,… until empty spot is found, place x there. Thm [K63]: Expected insertion time for Tth item is 1/(1-(T/M) 2 )+o(1).

9
Explicit Hash Functions Can sometimes analyze for explicit (e.g. universal [CW79] ) hash functions, but performance somewhat worse, and/or hash functions complex/inefficient. Noted since 1970s that simple hash functions match idealized performance on real data.

10
Simple Hash Functions Dont Always Work pairwise independent hash families & inputs s.t. Linear Probing has (log T) insertion time [PPR07]. k-wise independent hash families & inputs s.t. Bloom Filter error prob. higher than ideal [MV08]. Open for Balanced Allocations. Worst case does not match practice.

11
Average-Case Analysis? Data uniform & independent in [N]. –Not a good model for real data. –Trivializes hashing. Need intermediate model between worst- case and average-case analysis.

12
Our Model: Block Sources [CG85] Data is a finite stream, modeled by a sequence of random variables X 1,X 2,…X T [N] Each stream element has some k bits of (Renyi) entropy, conditioned on previous elements: where cp(X)= x Pr[X=x] 2. Similar spirit to semi-random graphs [BS95], smoothed analysis [ST01].

13
An Approach H truly random: for all distinct x 1,…,x T, (H(x 1 ),.. H(x T )) uniform in [M] T. Goal: if H random universal hash function and X 1,X 2,…X T is a block source, then (H(X 1 ),.. H(X T )) is close to uniform. Randomness extractors!

14
Classic Extractor Results [BBR88,ILL89,CG85,Z90] Leftover Hash Lemma: If H : [N] [M] is a random universal hash function and X has Renyi entropy at least log M + 2log(1/ ), then (H,H(X)) is -close to uniform. Thm: If H : [N] [M] is a random universal hash function and X 1,X 2,…X T is a block source with Renyi entropy at least log M + 2log(T/ ) per block, then (H,H(X 1 ),.. H(X T )) is -close to uniform.

15
Sample Parameters Network flows (IP addresses, ports, transport protocol): N = 2 104 Number of items: T = 2 16 Hash range (2 values per item): M = 2 32. Entropy needed per item: 64+2log(1/ ). Can we do better?

16
Improved Bounds I Thm [CV08]: If H : [N] [M] is a random universal hash function and X 1,X 2,…X T is a block source with Renyi entropy at least log M+log T+2log(1/ )+O(1) per block, then (H,H(X 1 ),.. H(X T )) is -close to uniform. Tight up to additive constant [CV08].

17
Improved Bounds II Thm [MV08,CV08]: If H : [N] [M] is a random universal hash function and X 1,X 2,…X T is a block source with Renyi entropy at least log M+log T+log(1/ )+O(1) per block, then (H,H(X 1 ),.. H(X T )) is -close to a distribution with collision probability O(1/M T ). Tight upto dependence on [CV08].

18
Proof Ideas: Upper Bounds 1. Bound average conditional collision probs: cp(H(X i )| H,H(X 1 ),.. H(X i-1 )) 1/M+1/2 k. 2a. Statistical closeness to uniform: inductively bound Hellinger distance from uniform. 2b. Close to small collision prob: by Markov, get (1/T) · i cp(H(X i )| H=h,H(X 1 )=y 1,.. H(X i-1 )=y i-1 ) 1/M+1/( 2 k ) w.p. 1- over h,y 1,..,y i-1

19
Proof Ideas: Lower Bounds Lower bound for randomness extractors [RT97]: if k not large enough, then X of min-entropy k s.t. h(X) far from uniform for most h. Take X 1,X 2,…X T to be iid copies of X. Show that error accumulates, e.g. statistical distance grows by a factor of ( T) [R04,CV08].

20
Open Problems Tightening connection to practice. –How to estimate relevant entropy of data streams? –Cryptographic hash functions (MD5,SHA-1)? –Other data models? Block source data model. –Other uses, implications?

Similar presentations

Presentation is loading. Please wait....

OK

Lecture 10: Search Structures and Hashing

Lecture 10: Search Structures and Hashing

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Free download ppt on flora and fauna of india Animated ppt on magnetism perfume Ppt on nuclear family and joint family band Ppt on effect of western culture on indian youth Ppt on water our life line screening Ppt on rag pickers Ppt on history of badminton in china Ppt on dressing etiquettes pour Download ppt on indus valley civilization geography Ppt on limits and continuity tutorial