Download presentation

Presentation is loading. Please wait.

Published byThomas Baldwin Modified over 3 years ago

1
An Optimal Algorithm for the Distinct Elements Problem Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010

2
Problem Description Given a long stream of values from a universe of size n –each value can occur any number of times –count the number F 0 of distinct values See values one at a time One pass over the stream Too expensive to store set of distinct values Algorithms should: –Use a small amount of memory –Have fast update time (per value processing time) –Have fast recovery time (time to report the answer)

3
Randomized Approximation Algorithms 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, , 338, 32, 4, … Consider algorithms that store a subset S of distinct values E.g., S = {3, 9, 32, 265} Main drawback is that S needs to be large to know if next value is a new distinct value Any algorithm (whether it stores a subset of values or not) that is either deterministic or computes F 0 exactly must use ¼ F 0 memory Hence, algorithms must be randomized and settle for an approximate solution: output F 2 [(1-ε)F 0, (1+ε)F 0 ] with good probability

4
Problem History Long sequence of work on the problem Flajolet and Martin introduced problem, FOCS 1983 Alon, Bar-Yossef, Beyer, Brody, Chakrabarti, Durand, Estan, Flajolet, Fisk, Fusy, Gandouet, Gemulla, Gibbons, P. Haas, Indyk, Jayram, Kumar, Martin, Matias, Meunier, Reinwald, Sismanis, Sivakumar, Szegedy, Tirthapura, Trevisan, Varghese, W Previous best algorithm: O(ε -2 log log n + log n) bits of memory and O(ε -2 ) update and reporting time Known lower bound on the memory: (ε -2 + log n) Our result: Optimal O(ε -2 + log n) bits of memory and O(1) update and reporting time

5
Previous Approaches Suppose we randomly hash F 0 values into a hash table of 1/ε 2 buckets and keep track of the number C of non-empty buckets If F 0 < 1/ε 2, there is a way to estimate F 0 up to (1 ± ε) from C Problem: if F 0 À 1/ε 2, with high probability, every bucket contains a value, so there is no information Solution: randomly choose S log n µ S log n - 1 µ S log n - 2 µ S 1 µ {1, 2, …, n}, where |S i | ¼ n/2 i stream: i-th substream:3, 265, 3, 9, 7, 9, 3, … Run hashing procedure on each substream There is an i for which the # of distinct values in i-th substream ¼ 1/ε 2 Hashing procedure on i-th substream works 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, , 338, 32, 4, … S i = {1, 3, 7, 9, 265} Problem: It takes 1/ε 2 log n bits of memory to keep track of this information

6
Our Techniques Observation: - Have 1/ε 2 global buckets - In each bucket we keep track of the index i of the set S i for the largest i for which S i contains a value hashed to the bucket - This gives O(1/ε 2 log log n) bits of memory New Ideas: - Can show with high probability, at every point in the stream, most buckets contain roughly the same index - We can just keep track of the offsets from this common index - We pack the offsets into machine words and use known fast read/write algorithms to variable length arrays to efficiently update offsets - Occasionally we need to decrement all offsets. Can spread the work across multiple updates

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google