The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
An Optimal Algorithm for the Distinct Elements Problem
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Sampling Distributions
Sketching for M-Estimators: A Unified Approach to Robust Regression
The General Linear Model. The Simple Linear Model Linear Regression.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
The Goldreich-Levin Theorem: List-decoding the Hadamard code
Evaluating Hypotheses
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.
Foundations of Cryptography Lecture 2 Lecturer: Moni Naor.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.
Statistical Decision Theory
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
1 Probability and Statistical Inference (9th Edition) Chapter 5 (Part 2/2) Distributions of Functions of Random Variables November 25, 2015.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Chapter 6 Large Random Samples Weiqi Luo ( 骆伟祺 ) School of Data & Computer Science Sun Yat-Sen University :
The Message Passing Communication Model David Woodruff IBM Almaden.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Information Complexity Lower Bounds
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Finding Frequent Items in Data Streams
Streaming & sampling.
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Presentation transcript:

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden

Problem Description Given a data stream of n insertions of records, count the number F 0 of distinct records One pass over the data stream Algorithms must use small amount of memory and have fast update time –too expensive to store set of distinct records –implies algorithms must be randomized and must settle for an approximate solution: output F 2 [(1- ² )F 0, (1+ ² )F 0 ] with constant probability

The Data Stream How is the data in the stream organized? –Usually, assume worst-case ordered In this case, £ (1/ ² 2 ) bits are necessary and sufficient to estimate F 0 –As the quality of approximation improves to say, ² = 1%, the quadratic dependence is a major drawback –Sometimes a random-ordering can be assumed Suppose we are mining salaries, and they are ordered alphabetically by surname. If there is no correlation between salaries and surname, then the stream of salaries is ordered randomly The backing sample architecture assumes data randomly ordered by design (Gibbons, Matias, Poosala) This model is referred to as the Random-Order Model (Guha, McGregor) Unfortunately, even in this model, we still need (1/ ² 2 ) bits to estimate F 0 (Chakrabarti, Cormode, McGregor). Intuitively this is because the data is still worst-case.

Random Data Model In an attempt to bypass the (1/ ² 2 ) bound, we propose to study the case when the data comes from an underlying distribution. –Problem 3 of Muthukrishnans book: Provide inproved estimates for L p sums including distinct element estimation if input stream has statistical properties such as being Zipfian. There is a distribution defined by probabilities p i, 1 · i · m, p i = 1. The next item in the stream is chosen independently of previous items, and is i with probability p i. We call this the Random-Data Model. The Random-Data Model is contained in the Random-Order Model.

Random Data Model This model for F 0 was implicitly studied before –by Motwani and Vassilvitskii when the distribution is Zipfian. This distribution is useful for estimating WebGraph statistics –sampling-based algorithms used in practice impose distributional assumptions, without which they have poor performance (Charikar et al) –Generalized Inverse Gaussian Poisson (GIGP) model studies sampling- based estimators when distribution is uniform and Zipfian –by Guha and McGregor for estimating the density function of an unknown distribution, which is useful in learning theory

Further Restriction We focus on the case when each probability p i = 1/d for an unknown value of d (so uniform over a subset of [m]) Captures the setting of sampling with replacement a set of unknown cardinality For a certain range of d, we show that one can beat the space lower bound that holds for adversarial data and randomly-ordered data For another choice of d, we show the lower bound for adversarial and randomly-ordered data also applies in this setting Distribution fairly robust in the sense that other distributions with a few heavy items and remaining items that are approximately uniform have the same properties above

Our Upper Bound 1-pass algorithm with an expected O(d(log 1/ ² )/(n ² 2 ) + log m) bits of space, whenever d ¸ 1/ ² 2 and d · n. The per-item processing time is constant. Recall the distribution is uniform over a d-element subset of [m], and we see n samples from it, so this is a typical setting of parameters. Notice for n even slightly larger than d, the algorithm does much better than the (1/ ² 2 ) lower bound in other data stream models. One can show for every combination of known algorithms with different space/time tradeoffs for F 0 in the adversarial model, our algorithm is either better in space or in time.

Our Lower Bound Our main technical result is that if n and d are £ (1/ ² 2 ), then even estimating F 0 in the random data model requires (1/ ² 2 ) space Lower bound subsumes previous lower bounds, showing that even for a natural (random) choice of data, the problem is hard Our choice of distribution for showing the lower bound was used in subsequent work by Chakrabarti and Brody, which turned out to be useful for establishing an (1/ ² 2 ) lower bound for constant pass algorithms for estimating F 0

Techniques: Upper Bound Very simple observation: –Since d · n, each item should have frequency n/d in the stream. –If n/d is at least (1/ ² 2 ), we can just compute n and the frequency of the first item in the stream to get a (1+ ² )-approximation to d –Using a balls-and-bins occupancy bound of Kamath et al, a good estimation of d implies a good estimation to F 0 If n/d is less than 1/ ² 2, we could instead store the first O(1/ ² 2 ) items (hashed appropriately for small space), treat them as a set S, and count the number of items in the remaining part of the stream that land in S –Correct, but unnecessary if d is much less than n.

Techniques: Upper Bound Instead record the first item x in the stream, and find the second position i of x in the stream. Position i should occur at roughly the d-th position in the stream, so i provides a constant factor approximation to d Since n = (d), position i should be in the first half of the stream with large constant probability. Now store the first i/(n ² 2 ) distinct stream elements in the second half of stream, treat them as a set S, and count the remaining items in the stream that occur in S. –Good enough for (1+ ² )-approximation –Space is O(log n + (i log m)/(n ² 2 )) ¼ O(log n + (d log m)/(n ² 2 ))

Techniques: Upper Bound Space is O(log n + (d log m)/(n ² 2 )), but we can do better. For each j in [m], sample j independently with probability 1/(i ² 2 ). In expectation new distribution is now uniform over d/(i ² 2 ) items. If j is sampled, say j survives. Go back to the previous step: store the first i/(n ² 2 ) distinct surviving stream elements in the second half of stream, treat them as a set S, and count the remaining items in the stream that occur in S. –Since only £ (1/ ² 2 ) items survive, can store S with only (i log 1/ ² )/(n ² 2 ) bits by hashing item IDs down to a range of size, say, 1/ ² 5 –We estimate the distributions support size in the sub-sampled stream, which is roughly d/(i ² 2 ). We can get a (1+ ² )-approximation to this quantity provided it is at least (1/ ² 2 ), which it is with high probability. Then scale by i ² 2 to estimate d, and thus F 0 by previous reasoning. –Constant update time

Techniques: Lower Bound Consider the uniform distribution ¹ on [d] = {1, 2, …, d}. d and n are £ ( 1/ ² 2 ) Consider a stream of n samples from ¹ where we choose n so that 1-(1-1/d) n/2 in [1/3, 2/3] Let X be the characteristic vector of the first n/2 stream samples on the universe [d]. So X i = 1 if and only if item i occurs in these samples. Let Y be the characteristic vector of the second n/2 stream samples on the universe [d]. So Y i = 1 if and only if item i occurs in these samples. Let wt(X), wt(Y) be the number of ones in vectors X, Y. We consider a communication game. Alice is given X and wt(Y), while Bob is given Y and wt(X), and they want to know if Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d Alice and Bob should solve this problem with large probability, where the probability is over the choice of X and Y, and their coin tosses (which can be fixed).

Techniques: Lower Bound Strategy (1) show the space complexity S of the streaming algorithm is at least the one-way communication complexity CC of the game (2) lower bound CC Theorem: S ¸ CC Proof: Alice runs the streaming algorithm on a random stream a X generated by her characteristic vector X. Alice transmits the state to Bob Bob continues the computation of the streaming algorithm on a random stream a Y with characteristic vector Y At the end the algorithm estimates F 0 of a stream whose elements are in the support of X or Y (or both) Notice that the two halves of the stream are independent in the random data model, so the stream generated has the right distribution

Techniques: Lower Bound We show the estimate of F 0 can solve the communication game Remember, the communication game is to decide whether Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d Note the quantity on the right is the expected value of Delta(X,Y) Some Lemmas (1) Pr[d/4 · wt(X), wt(Y) · 3d/4] = 1-o(1) (2) Consider the variable X, distributed as X, but conditioned on wt(X) = k. The variable Y is distributed as Y, but conditioned on wt(Y) = r. –X and Y are uniform over k and r bit strings, respectively –Choose k, r to be integers in [d/4, 3d/4] Then for any constant δ > 0, there is a constant ® > 0, so that Pr[| ¢ (X,Y) – E[ ¢ (X, Y)]| ¸ ® d 1/2 ] ¸ 1- δ Follows from standard deviation of hypergeometric distribution

Techniques: Lower Bound ¢ (X,Y) = 2F 0 (a X a Y )-wt(X)-wt(Y) Note that Bob has wt(X) and wt(Y), so he can compute ¿ = wt(X) + wt(Y) – wt(X)wt(Y)/d If F is the output of the streaming algorithm, Bob simply computes 2F-wt(X)-wt(Y) and checks if it is greater than ¿ F 2 [(1- ² )F 0, (1+ ² )F 0 ] with large constant probability 2F-wt(X)-wt(Y) 2 [2F 0 -wt(X)-wt(Y)-2 ² F 0,, 2F 0 -wt(X)-wt(Y)+2 ² F 0 ] = [ ¢ (X,Y)-2 ² F 0, ¢ (X,Y)+ 2 ² F 0 ] = [ ¢ (X,Y)- £ ( ² d), ¢ (X,Y) + £ ( ² d)] = [ ¢ (X,Y) - £ (d 1/2 ), ¢ (X,Y) + £ (d 1/2 )]

Techniques: Lower Bound By our lemmas, and using that E[ ¢ (X, Y)] = ¿, with large constant probability either ¢ (X,Y) > ¿ + ® d 1/2 or ¢ (X,Y) < ¿ - ® d 1/2 By previous slide, Bobs value 2F-wt(X)-wt(Y) is in the range [ ¢ (X,Y) - £ (d 1/2 ), ¢ (X,Y) + £ (d 1/2 )] If ¢ (X,Y) > ¿ + ® d 1/2, then 2F-wt(X)-wt(Y) > ¿ + ® d 1/2 - £ (d 1/2 ) = ¿ If ¢ (X,Y) < ¿ - ® d 1/2, then 2F-wt(X)-wt(Y) < ¿ - ® d 1/2 + £ (d 1/2 ) = ¿ So Bob can use the output of the streaming algorithm to solve the communication problem, so S ¸ CC

Lower Bounding CC Communication game: Alice is given X and wt(Y), while Bob is given Y and wt(X), and they want to know if Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d Here X,Y have the distribution of being independent characteristic vectors of a stream on n/2 samples, where 1-(1-1/d) n/2 2 [1/3, 2/3] With large probability, Pr[d/4 · wt(X), wt(Y) · 3d/4] = 1-o(1) By averaging, a correct protocol is also correct with large probability for fixed weights i and j in [d/4, 3d/4], so we can assume X is a random string of Hamming weight i, and Y a random string of Hamming weight j.

Lower Bounding CC …… X such that wt(X) = i Y such that wt(Y) = j With large probability, the message M that Alice sends has the property that many different X cause Alice to send M. Say such an M is large. Since i, j 2 [d/4, 3d/4], one can show that the fraction of 1s in each row is in [1/2-o(1),1/2+o(1)] Recall that an entry is 1 if and only if ¢ (X,Y) ¸ i + j – 2i ¢ j/d We show that for any large M, the fraction of 1s in most of the columns is in, say, [1/10, 9/10]. Then Bob doesnt know what to output. Since each row is roughly balanced, the expected fraction of 1s in each column is in [1/2-o(1), 1/2+o(1)].

Lower Bounding CC …… Since each row is roughly balanced, the expected fraction of 1s in each column is in [1/2-o(1), 1/2+o(1)]. But the variance could be huge, e.g., the matrix may look like this, in which case Bob can easily output the answer.

Lower Bounding CC Can show this doesnt happen by the second-moment method. Let V y be the fraction of 1s in the column indexed by y. Let S be the set of x which cause Alice to send a large message M. Consider random Y. Let C u = 1 if and only if ¢ (u,Y) > ¿. So V = (1/|S|) sum u in S C u Var[V] = (1/|S| 2 ) sum u,v in S (E[C u C v ] – E[C u ]E[C v ]). Notice that E[C u ]E[C v ] 2 [1/4-o(1), 1/4+o(1)] while E[C u C v ] = Pr[ ¢ (u,Y) > ¿ | ¢ (v,Y) > ¿ ] ¢ ½. Since S is a large set, most pairs u, v 2 S have Hamming distance in [d/5, 4d/5]. –A technical lemma shows that Pr[ ¢ (u,Y) > ¿ | ¢ (v,Y) > ¿ ] is a constant strictly less than 1. Hence, E[C u C v ] is a constant strictly less than 1/2, and since this happens for most pairs u, v, we get that Var[V] is a constant strictly less than 1/4.

Lower Bounding CC Fraction of 1s in a random column is just V = (1/|S|) sum u in S C u Let · be a small positive constant. By Chebyshevs inequality, Pr[|V-1/2| > 1/2- · ] ½- · -o(1)] · Var[V]/(1/2- · ) 2 + o(1) But we showed Var[V] is a constant strictly less than ¼, so this probability is a constant strictly less than 1 for small enough ·. Hence, for a random column Y, the fraction of 1s is in [ ·, 1- · ] with constant probability. It follows that with some constant probability Bob outputs the wrong answer. Hence, most messages of Alice must be small, so one can show that there must be 2 d) of them, so that communication is (d) = (1/ ² 2 ).

Conclusions Introduced random data model, and studied F 0 -estimation under distributions uniform over a subset of d items For a certain range of d, we show that one can beat the space lower bound that holds for adversarial data and randomly-ordered data For another choice of d, we show the lower bound for adversarial and randomly-ordered data also applies in this setting Are there other natural distributions that admit more space-efficient algorithms in this model? Are there other useful ways of bypassing the (1/ ² 2 ) lower bound?