 # Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.

## Presentation on theme: "Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung."— Presentation transcript:

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung

The Question Traditional analyses of hashing-based algorithms & data structures assume a truly random hash function. In practice: simple (e.g. universal) hash functions perform just as well. Why?

Outline Three hashing applications The new model and results Proof ideas

Bloom Filters To approximately store S = {x 1,…,x T } [N]: Start with array of M=O(T) zeroes. Hash each item k=O(1) times to [M] using h : [N] [M] k, put a one in each location. To test y S: Hash & accept if ones in all k locations.

Bloom Filter Analysis Thm [B70]: S y S, if h is a truly random hash function, Pr h [accept y] = 2 -(ln 2)·M/T +o(1). for an optimal choice of k.

Balanced Allocations Hashing T items into T buckets –What is the maximum number of items, or load, of any bucket? –Assume buckets chosen independently & uniformly at random. Well-known result: (log T / log log T) maximum load w.h.p.

Power of Two Choices Suppose each ball can pick two bins independently and uniformly and choose the bin with less load. Thm [ABKU94]: maximum load log log n / log 2 + (1) w.h.p.

Linear Probing Hash elements into an array of length M. If h(x) is already full, try h(x)+1,h(x)+2,… until empty spot is found, place x there. Thm [K63]: Expected insertion time for Tth item is 1/(1-(T/M) 2 )+o(1).

Explicit Hash Functions Can sometimes analyze for explicit (e.g. universal [CW79] ) hash functions, but performance somewhat worse, and/or hash functions complex/inefficient. Noted since 1970s that simple hash functions match idealized performance on real data.

Simple Hash Functions Dont Always Work pairwise independent hash families & inputs s.t. Linear Probing has (log T) insertion time [PPR07]. k-wise independent hash families & inputs s.t. Bloom Filter error prob. higher than ideal [MV08]. Open for Balanced Allocations. Worst case does not match practice.

Average-Case Analysis? Data uniform & independent in [N]. –Not a good model for real data. –Trivializes hashing. Need intermediate model between worst- case and average-case analysis.

Our Model: Block Sources [CG85] Data is a finite stream, modeled by a sequence of random variables X 1,X 2,…X T [N] Each stream element has some k bits of (Renyi) entropy, conditioned on previous elements: where cp(X)= x Pr[X=x] 2. Similar spirit to semi-random graphs [BS95], smoothed analysis [ST01].

An Approach H truly random: for all distinct x 1,…,x T, (H(x 1 ),.. H(x T )) uniform in [M] T. Goal: if H random universal hash function and X 1,X 2,…X T is a block source, then (H(X 1 ),.. H(X T )) is close to uniform. Randomness extractors!

Classic Extractor Results [BBR88,ILL89,CG85,Z90] Leftover Hash Lemma: If H : [N] [M] is a random universal hash function and X has Renyi entropy at least log M + 2log(1/ ), then (H,H(X)) is -close to uniform. Thm: If H : [N] [M] is a random universal hash function and X 1,X 2,…X T is a block source with Renyi entropy at least log M + 2log(T/ ) per block, then (H,H(X 1 ),.. H(X T )) is -close to uniform.

Sample Parameters Network flows (IP addresses, ports, transport protocol): N = 2 104 Number of items: T = 2 16 Hash range (2 values per item): M = 2 32. Entropy needed per item: 64+2log(1/ ). Can we do better?

Improved Bounds I Thm [CV08]: If H : [N] [M] is a random universal hash function and X 1,X 2,…X T is a block source with Renyi entropy at least log M+log T+2log(1/ )+O(1) per block, then (H,H(X 1 ),.. H(X T )) is -close to uniform. Tight up to additive constant [CV08].

Improved Bounds II Thm [MV08,CV08]: If H : [N] [M] is a random universal hash function and X 1,X 2,…X T is a block source with Renyi entropy at least log M+log T+log(1/ )+O(1) per block, then (H,H(X 1 ),.. H(X T )) is -close to a distribution with collision probability O(1/M T ). Tight upto dependence on [CV08].

Proof Ideas: Upper Bounds 1. Bound average conditional collision probs: cp(H(X i )| H,H(X 1 ),.. H(X i-1 )) 1/M+1/2 k. 2a. Statistical closeness to uniform: inductively bound Hellinger distance from uniform. 2b. Close to small collision prob: by Markov, get (1/T) · i cp(H(X i )| H=h,H(X 1 )=y 1,.. H(X i-1 )=y i-1 ) 1/M+1/( 2 k ) w.p. 1- over h,y 1,..,y i-1

Proof Ideas: Lower Bounds Lower bound for randomness extractors [RT97]: if k not large enough, then X of min-entropy k s.t. h(X) far from uniform for most h. Take X 1,X 2,…X T to be iid copies of X. Show that error accumulates, e.g. statistical distance grows by a factor of ( T) [R04,CV08].

Open Problems Tightening connection to practice. –How to estimate relevant entropy of data streams? –Cryptographic hash functions (MD5,SHA-1)? –Other data models? Block source data model. –Other uses, implications?

Download ppt "Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung."

Similar presentations