Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Optimal Algorithm for the Distinct Elements Problem Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010.

Similar presentations


Presentation on theme: "An Optimal Algorithm for the Distinct Elements Problem Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010."— Presentation transcript:

1 An Optimal Algorithm for the Distinct Elements Problem Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010

2 Problem Description Given a long stream of values from a universe of size n –each value can occur any number of times –count the number F 0 of distinct values See values one at a time One pass over the stream Too expensive to store set of distinct values Algorithms should: –Use a small amount of memory –Have fast update time (per value processing time) –Have fast recovery time (time to report the answer)

3 Randomized Approximation Algorithms 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, , 338, 32, 4, … Consider algorithms that store a subset S of distinct values E.g., S = {3, 9, 32, 265} Main drawback is that S needs to be large to know if next value is a new distinct value Any algorithm (whether it stores a subset of values or not) that is either deterministic or computes F 0 exactly must use ¼ F 0 memory Hence, algorithms must be randomized and settle for an approximate solution: output F 2 [(1-ε)F 0, (1+ε)F 0 ] with good probability

4 Problem History Long sequence of work on the problem Flajolet and Martin introduced problem, FOCS 1983 Alon, Bar-Yossef, Beyer, Brody, Chakrabarti, Durand, Estan, Flajolet, Fisk, Fusy, Gandouet, Gemulla, Gibbons, P. Haas, Indyk, Jayram, Kumar, Martin, Matias, Meunier, Reinwald, Sismanis, Sivakumar, Szegedy, Tirthapura, Trevisan, Varghese, W Previous best algorithm: O(ε -2 log log n + log n) bits of memory and O(ε -2 ) update and reporting time Known lower bound on the memory: (ε -2 + log n) Our result: Optimal O(ε -2 + log n) bits of memory and O(1) update and reporting time

5 Previous Approaches Suppose we randomly hash F 0 values into a hash table of 1/ε 2 buckets and keep track of the number C of non-empty buckets If F 0 < 1/ε 2, there is a way to estimate F 0 up to (1 ± ε) from C Problem: if F 0 À 1/ε 2, with high probability, every bucket contains a value, so there is no information Solution: randomly choose S log n µ S log n - 1 µ S log n - 2 µ S 1 µ {1, 2, …, n}, where |S i | ¼ n/2 i stream: i-th substream:3, 265, 3, 9, 7, 9, 3, … Run hashing procedure on each substream There is an i for which the # of distinct values in i-th substream ¼ 1/ε 2 Hashing procedure on i-th substream works 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, , 338, 32, 4, … S i = {1, 3, 7, 9, 265} Problem: It takes 1/ε 2 log n bits of memory to keep track of this information

6 Our Techniques Observation: - Have 1/ε 2 global buckets - In each bucket we keep track of the index i of the set S i for the largest i for which S i contains a value hashed to the bucket - This gives O(1/ε 2 log log n) bits of memory New Ideas: - Can show with high probability, at every point in the stream, most buckets contain roughly the same index - We can just keep track of the offsets from this common index - We pack the offsets into machine words and use known fast read/write algorithms to variable length arrays to efficiently update offsets - Occasionally we need to decrement all offsets. Can spread the work across multiple updates


Download ppt "An Optimal Algorithm for the Distinct Elements Problem Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010."

Similar presentations


Ads by Google