 # The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

## Presentation on theme: "The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden."— Presentation transcript:

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden

Problem Description Given a data stream of n insertions of records, count the number F 0 of distinct records One pass over the data stream Algorithms must use small amount of memory and have fast update time –too expensive to store set of distinct records –implies algorithms must be randomized and must settle for an approximate solution: output F 2 [(1- ² )F 0, (1+ ² )F 0 ] with constant probability

The Data Stream How is the data in the stream organized? –Usually, assume worst-case ordered In this case, £ (1/ ² 2 ) bits are necessary and sufficient to estimate F 0 –As the quality of approximation improves to say, ² = 1%, the quadratic dependence is a major drawback –Sometimes a random-ordering can be assumed Suppose we are mining salaries, and they are ordered alphabetically by surname. If there is no correlation between salaries and surname, then the stream of salaries is ordered randomly The backing sample architecture assumes data randomly ordered by design (Gibbons, Matias, Poosala) This model is referred to as the Random-Order Model (Guha, McGregor) Unfortunately, even in this model, we still need (1/ ² 2 ) bits to estimate F 0 (Chakrabarti, Cormode, McGregor). Intuitively this is because the data is still worst-case.

Random Data Model In an attempt to bypass the (1/ ² 2 ) bound, we propose to study the case when the data comes from an underlying distribution. –Problem 3 of Muthukrishnans book: Provide inproved estimates for L p sums including distinct element estimation if input stream has statistical properties such as being Zipfian. There is a distribution defined by probabilities p i, 1 · i · m, p i = 1. The next item in the stream is chosen independently of previous items, and is i with probability p i. We call this the Random-Data Model. The Random-Data Model is contained in the Random-Order Model.

Random Data Model This model for F 0 was implicitly studied before –by Motwani and Vassilvitskii when the distribution is Zipfian. This distribution is useful for estimating WebGraph statistics –sampling-based algorithms used in practice impose distributional assumptions, without which they have poor performance (Charikar et al) –Generalized Inverse Gaussian Poisson (GIGP) model studies sampling- based estimators when distribution is uniform and Zipfian –by Guha and McGregor for estimating the density function of an unknown distribution, which is useful in learning theory

Further Restriction We focus on the case when each probability p i = 1/d for an unknown value of d (so uniform over a subset of [m]) Captures the setting of sampling with replacement a set of unknown cardinality For a certain range of d, we show that one can beat the space lower bound that holds for adversarial data and randomly-ordered data For another choice of d, we show the lower bound for adversarial and randomly-ordered data also applies in this setting Distribution fairly robust in the sense that other distributions with a few heavy items and remaining items that are approximately uniform have the same properties above

Our Upper Bound 1-pass algorithm with an expected O(d(log 1/ ² )/(n ² 2 ) + log m) bits of space, whenever d ¸ 1/ ² 2 and d · n. The per-item processing time is constant. Recall the distribution is uniform over a d-element subset of [m], and we see n samples from it, so this is a typical setting of parameters. Notice for n even slightly larger than d, the algorithm does much better than the (1/ ² 2 ) lower bound in other data stream models. One can show for every combination of known algorithms with different space/time tradeoffs for F 0 in the adversarial model, our algorithm is either better in space or in time.

Our Lower Bound Our main technical result is that if n and d are £ (1/ ² 2 ), then even estimating F 0 in the random data model requires (1/ ² 2 ) space Lower bound subsumes previous lower bounds, showing that even for a natural (random) choice of data, the problem is hard Our choice of distribution for showing the lower bound was used in subsequent work by Chakrabarti and Brody, which turned out to be useful for establishing an (1/ ² 2 ) lower bound for constant pass algorithms for estimating F 0

Techniques: Upper Bound Very simple observation: –Since d · n, each item should have frequency n/d in the stream. –If n/d is at least (1/ ² 2 ), we can just compute n and the frequency of the first item in the stream to get a (1+ ² )-approximation to d –Using a balls-and-bins occupancy bound of Kamath et al, a good estimation of d implies a good estimation to F 0 If n/d is less than 1/ ² 2, we could instead store the first O(1/ ² 2 ) items (hashed appropriately for small space), treat them as a set S, and count the number of items in the remaining part of the stream that land in S –Correct, but unnecessary if d is much less than n.

Techniques: Upper Bound Instead record the first item x in the stream, and find the second position i of x in the stream. Position i should occur at roughly the d-th position in the stream, so i provides a constant factor approximation to d Since n = (d), position i should be in the first half of the stream with large constant probability. Now store the first i/(n ² 2 ) distinct stream elements in the second half of stream, treat them as a set S, and count the remaining items in the stream that occur in S. –Good enough for (1+ ² )-approximation –Space is O(log n + (i log m)/(n ² 2 )) ¼ O(log n + (d log m)/(n ² 2 ))

Techniques: Upper Bound Space is O(log n + (d log m)/(n ² 2 )), but we can do better. For each j in [m], sample j independently with probability 1/(i ² 2 ). In expectation new distribution is now uniform over d/(i ² 2 ) items. If j is sampled, say j survives. Go back to the previous step: store the first i/(n ² 2 ) distinct surviving stream elements in the second half of stream, treat them as a set S, and count the remaining items in the stream that occur in S. –Since only £ (1/ ² 2 ) items survive, can store S with only (i log 1/ ² )/(n ² 2 ) bits by hashing item IDs down to a range of size, say, 1/ ² 5 –We estimate the distributions support size in the sub-sampled stream, which is roughly d/(i ² 2 ). We can get a (1+ ² )-approximation to this quantity provided it is at least (1/ ² 2 ), which it is with high probability. Then scale by i ² 2 to estimate d, and thus F 0 by previous reasoning. –Constant update time

Techniques: Lower Bound Consider the uniform distribution ¹ on [d] = {1, 2, …, d}. d and n are £ ( 1/ ² 2 ) Consider a stream of n samples from ¹ where we choose n so that 1-(1-1/d) n/2 in [1/3, 2/3] Let X be the characteristic vector of the first n/2 stream samples on the universe [d]. So X i = 1 if and only if item i occurs in these samples. Let Y be the characteristic vector of the second n/2 stream samples on the universe [d]. So Y i = 1 if and only if item i occurs in these samples. Let wt(X), wt(Y) be the number of ones in vectors X, Y. We consider a communication game. Alice is given X and wt(Y), while Bob is given Y and wt(X), and they want to know if Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d Alice and Bob should solve this problem with large probability, where the probability is over the choice of X and Y, and their coin tosses (which can be fixed).

Techniques: Lower Bound Strategy (1) show the space complexity S of the streaming algorithm is at least the one-way communication complexity CC of the game (2) lower bound CC Theorem: S ¸ CC Proof: Alice runs the streaming algorithm on a random stream a X generated by her characteristic vector X. Alice transmits the state to Bob Bob continues the computation of the streaming algorithm on a random stream a Y with characteristic vector Y At the end the algorithm estimates F 0 of a stream whose elements are in the support of X or Y (or both) Notice that the two halves of the stream are independent in the random data model, so the stream generated has the right distribution

Techniques: Lower Bound We show the estimate of F 0 can solve the communication game Remember, the communication game is to decide whether Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d Note the quantity on the right is the expected value of Delta(X,Y) Some Lemmas (1) Pr[d/4 · wt(X), wt(Y) · 3d/4] = 1-o(1) (2) Consider the variable X, distributed as X, but conditioned on wt(X) = k. The variable Y is distributed as Y, but conditioned on wt(Y) = r. –X and Y are uniform over k and r bit strings, respectively –Choose k, r to be integers in [d/4, 3d/4] Then for any constant δ > 0, there is a constant ® > 0, so that Pr[| ¢ (X,Y) – E[ ¢ (X, Y)]| ¸ ® d 1/2 ] ¸ 1- δ Follows from standard deviation of hypergeometric distribution

Techniques: Lower Bound ¢ (X,Y) = 2F 0 (a X a Y )-wt(X)-wt(Y) Note that Bob has wt(X) and wt(Y), so he can compute ¿ = wt(X) + wt(Y) – wt(X)wt(Y)/d If F is the output of the streaming algorithm, Bob simply computes 2F-wt(X)-wt(Y) and checks if it is greater than ¿ F 2 [(1- ² )F 0, (1+ ² )F 0 ] with large constant probability 2F-wt(X)-wt(Y) 2 [2F 0 -wt(X)-wt(Y)-2 ² F 0,, 2F 0 -wt(X)-wt(Y)+2 ² F 0 ] = [ ¢ (X,Y)-2 ² F 0, ¢ (X,Y)+ 2 ² F 0 ] = [ ¢ (X,Y)- £ ( ² d), ¢ (X,Y) + £ ( ² d)] = [ ¢ (X,Y) - £ (d 1/2 ), ¢ (X,Y) + £ (d 1/2 )]

Techniques: Lower Bound By our lemmas, and using that E[ ¢ (X, Y)] = ¿, with large constant probability either ¢ (X,Y) > ¿ + ® d 1/2 or ¢ (X,Y) < ¿ - ® d 1/2 By previous slide, Bobs value 2F-wt(X)-wt(Y) is in the range [ ¢ (X,Y) - £ (d 1/2 ), ¢ (X,Y) + £ (d 1/2 )] If ¢ (X,Y) > ¿ + ® d 1/2, then 2F-wt(X)-wt(Y) > ¿ + ® d 1/2 - £ (d 1/2 ) = ¿ If ¢ (X,Y) < ¿ - ® d 1/2, then 2F-wt(X)-wt(Y) < ¿ - ® d 1/2 + £ (d 1/2 ) = ¿ So Bob can use the output of the streaming algorithm to solve the communication problem, so S ¸ CC

Lower Bounding CC Communication game: Alice is given X and wt(Y), while Bob is given Y and wt(X), and they want to know if Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d Here X,Y have the distribution of being independent characteristic vectors of a stream on n/2 samples, where 1-(1-1/d) n/2 2 [1/3, 2/3] With large probability, Pr[d/4 · wt(X), wt(Y) · 3d/4] = 1-o(1) By averaging, a correct protocol is also correct with large probability for fixed weights i and j in [d/4, 3d/4], so we can assume X is a random string of Hamming weight i, and Y a random string of Hamming weight j.

Lower Bounding CC 01000111 11110000 …… 10001011 00110011 10101100 X such that wt(X) = i Y such that wt(Y) = j With large probability, the message M that Alice sends has the property that many different X cause Alice to send M. Say such an M is large. Since i, j 2 [d/4, 3d/4], one can show that the fraction of 1s in each row is in [1/2-o(1),1/2+o(1)] Recall that an entry is 1 if and only if ¢ (X,Y) ¸ i + j – 2i ¢ j/d We show that for any large M, the fraction of 1s in most of the columns is in, say, [1/10, 9/10]. Then Bob doesnt know what to output. Since each row is roughly balanced, the expected fraction of 1s in each column is in [1/2-o(1), 1/2+o(1)].

Lower Bounding CC 11110000 11110000 …… 10001011 00110011 10101100 Since each row is roughly balanced, the expected fraction of 1s in each column is in [1/2-o(1), 1/2+o(1)]. But the variance could be huge, e.g., the matrix may look like this, in which case Bob can easily output the answer.

Lower Bounding CC Can show this doesnt happen by the second-moment method. Let V y be the fraction of 1s in the column indexed by y. Let S be the set of x which cause Alice to send a large message M. Consider random Y. Let C u = 1 if and only if ¢ (u,Y) > ¿. So V = (1/|S|) sum u in S C u Var[V] = (1/|S| 2 ) sum u,v in S (E[C u C v ] – E[C u ]E[C v ]). Notice that E[C u ]E[C v ] 2 [1/4-o(1), 1/4+o(1)] while E[C u C v ] = Pr[ ¢ (u,Y) > ¿ | ¢ (v,Y) > ¿ ] ¢ ½. Since S is a large set, most pairs u, v 2 S have Hamming distance in [d/5, 4d/5]. –A technical lemma shows that Pr[ ¢ (u,Y) > ¿ | ¢ (v,Y) > ¿ ] is a constant strictly less than 1. Hence, E[C u C v ] is a constant strictly less than 1/2, and since this happens for most pairs u, v, we get that Var[V] is a constant strictly less than 1/4.

Lower Bounding CC Fraction of 1s in a random column is just V = (1/|S|) sum u in S C u Let · be a small positive constant. By Chebyshevs inequality, Pr[|V-1/2| > 1/2- · ] ½- · -o(1)] · Var[V]/(1/2- · ) 2 + o(1) But we showed Var[V] is a constant strictly less than ¼, so this probability is a constant strictly less than 1 for small enough ·. Hence, for a random column Y, the fraction of 1s is in [ ·, 1- · ] with constant probability. It follows that with some constant probability Bob outputs the wrong answer. Hence, most messages of Alice must be small, so one can show that there must be 2 d) of them, so that communication is (d) = (1/ ² 2 ).

Conclusions Introduced random data model, and studied F 0 -estimation under distributions uniform over a subset of d items For a certain range of d, we show that one can beat the space lower bound that holds for adversarial data and randomly-ordered data For another choice of d, we show the lower bound for adversarial and randomly-ordered data also applies in this setting Are there other natural distributions that admit more space-efficient algorithms in this model? Are there other useful ways of bypassing the (1/ ² 2 ) lower bound?