Stochastic Streams: Sample Complexity vs. Space Complexity

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
Sketching for M-Estimators: A Unified Approach to Robust Regression
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
Artur Czumaj Dept of Computer Science & DIMAP University of Warwick Testing Expansion in Bounded Degree Graphs Joint work with Christian Sohler.
Entropy Rates of a Stochastic Process
Complexity 18-1 Complexity Andrei Bulatov Probabilistic Algorithms.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Department of Computer Science & Engineering University of Washington
The Goldreich-Levin Theorem: List-decoding the Hadamard code
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
Complexity 1 Mazes And Random Walks. Complexity 2 Can You Solve This Maze?
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Randomized Algorithms Morteza ZadiMoghaddam Amin Sayedi.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Distributed Coloring Discrete Mathematics and Algorithms Seminar Melih Onus November
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
1 Permutation routing in n-cube. 2 n-cube 1-cube2-cube3-cube 4-cube.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)
Data Stream Algorithms Lower Bounds Graham Cormode
Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
The Message Passing Communication Model David Woodruff IBM Almaden.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Information Complexity Lower Bounds
New Characterizations in Turnstile Streams with Applications
Approximating the MST Weight in Sublinear Time
Streaming & sampling.
Generalization and adaptivity in stochastic convex optimization
From dense to sparse and back again: On testing graph properties (and some properties of Oded)
Path Coupling And Approximate Counting
Lecture 18: Uniformity Testing Monotonicity Testing
Background: Lattices and the Learning-with-Errors problem
Vitaly (the West Coast) Feldman
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
CIS 700: “algorithms for Big Data”
Linear sketching with parities
The Communication Complexity of Distributed Set-Joins
Linear sketching over
CSCI B609: “Foundations of Data Science”
On the effect of randomness on planted 3-coloring models
Neuro-RAM Unit in Spiking Neural Networks with Applications
Linear sketching with parities
CSCI B609: “Foundations of Data Science”
Lecture 8: Synchronous Network Algorithms
Lecture 6: Counting triangles Dynamic graphs & sampling
Locality In Distributed Graph Algorithms
Presentation transcript:

Stochastic Streams: Sample Complexity vs. Space Complexity David Woodruff IBM Almaden Joint work with Michael Crouch, Andrew McGregor, and Greg Valiant

Motivation (Well-studied) Statistics question: how many samples from a distribution are needed to estimate a property of a distribution? (Well-studied) Streaming question: for a given fixed stream of samples, how much space is needed to estimate a property of a distribution? Our work: understand the tradeoff between the sample and space complexity

Model … 4 3 7 3 1 1 2 Algorithm sees a stream of i.i.d. samples from a distribution Algorithm only has 1 pass over the samples Goal: understand the tradeoff between the number t of samples needed to solve a problem, versus the space s of the algorithm

Problems (Statistics) Given t samples from a distribution p = p 1 ,…, p n on n items, estimate the collision probability i p i 2 up to a 1+ϵ - factor (Graph Problems) Given t independent edges chosen with replacement from a graph G, decide if G is connected (Linear Algebra) Given t independent samples from a subspace S of GF 2 d , determine if S has dimension d/2 or dimension d

Talk Outline Sample/Space Tradeoff for Collision Probability Estimation Sample/Space Tradeoff for Deciding Connectivity Sample/Space Tradeoff for Determining if a Subspace is Full Rank

Collision Probability Given t samples from distribution p = p 1 ,…, p n on n items, estimate the collision probability i p i 2 up to a 1+ϵ - factor Collision probability is called F 2 If t=o( n .5 ), impossible with any amount of space Distinguish p = 1 n , …, 1 n vs. p = ( 2 n ,…, 2 n , 0, …, 0) Our algorithm For any t = Ω 𝜖 ( n .5 ), can 1+ϵ -approximate F 2 with t samples and O ϵ (1+ n t ) space

Collision Probability Algorithm Break the t samples into t/w contiguous groups of w samples 4 … 3 7 … 3 1 … 1 … Group 1 Group 2 Group 3 For each group of samples a 1 ,…, a w let X i,j =1 if a i = a j , and let X= 1 w(w−1) i≠j X i,j be the probability of a collision on the group Use w log n bits of space to compute an estimate X for a group, and average estimates over t/w groups

Collision Probability Algorithm E X =E X i,j = k p k 2 = F 2 Var X ≤ F 2 w w−1 +Θ( n .5 F 2 w ) Chebyshev’s inequality implies a 1+ϵ -approximation to F 2 with error probability O n wt ϵ 2 + n .5 t ϵ 2 Set t> n .5 𝜖 2 , and w=O n tϵ 2

Collision Probability Lower Bound Use lower bound for random order streams Case 1: see a stream a 1 ,…, a t of t < n distinct items from a universe U Case 2: see a stream a 1 ,…, a t of t-r distinct items together with an item i which occurs r times, all from universe U Order of streams is random [AOMP, GH] any streaming algorithm needs Ω t r 2.5 space to distinguish the cases, even with an infinite random tape (conjectured space: Ω t r 2 )

Collision Probability Lower Bound Choose a random function h: U -> [n] Given a stream a 1 ,…, a t of items in a random order, feed the algorithm for IID streams the stream h(a 1 ),…, h(a t ) a 1 a 2 a 3 a 4 a 5 … a t h(a 1 ) h(a 2 ) h(a 3 ) h(a 4 ) h(a 5 ) … h(a t )

Collision Probability Lower Bound If a 1 ,…, a t are distinct, obtain IID samples from distribution ( 1 n , …, 1 n ) If a 1 ,…, a t are distinct together with an item i occurring r times, roughly see IID samples from distribution ( 1 n − r nt ,…, 1 n − r nt , r t , 1 n − r nt , …, 1 n − r nt ) If r>t/ n .5 , then F 2 in two cases differs by a constant factor Implies w=Ω( n 5 4 t 1.5 ) (Conjectured =Ω( n t )) Question: Extend to F k

Talk Outline Sample/Space Tradeoff for Collision Probability Estimation Sample/Space Tradeoff for Deciding Connectivity Sample/Space Tradeoff for Determining if a Subspace is Full Rank

Graph Connectivity Given t independent edges chosen with replacement from graph G, decide if G is connected Simulate a random walk starting at node 1 Store current vertex If see an edge not incident to the current vertex, discard it Remember first node i which you haven’t seen. Finish when i > n

Graph Connectivity 2 Current Vertex: First Untouched Vertex: 1 Start at vertex 1 3 1 done 2 4 3 4 See IID Stream: {1, 4}, {2, 3}, {1, 4}, {3,4}, {1,2}, {2, 3}, {1,2}, {3,4} do nothing do nothing do nothing O(log n) space, and O m (m n 2 ) =O m 2 n 2 samples

The Loopy Graph For each vertex v in G, add m− d v self-loops to vertex v The resulting “loopy graph” H is m-regular and has mn edges Perform previous algorithm on H Each sample is important! O(log n) space, but only O mn n 2 =O m n 3 samples

Use More Space and Fewer Samples We have an algorithm with O(log n) space and O m n 3 samples How can we use more space but fewer samples? Take p random walks in loopy graph, and remember current vertex of each one O(p log n) space Use each sample to update all p random walks Issue: random walks not independent Fix: can simulate independence since most walks don’t move on a given sample What to do with p independent random walks?

Space/Time Tradeoff for Connectivity [Feige] For any k, the following is a correct algorithm whp: (Phase 1) Ensure graph has no connected components of size ≤k (Phase 2) Sample n log n k vertices and verify the samples are connected If there is a component with ≤k vertices, declare graph is disconnected Otherwise, suppose we are in phase 2 x x Will sample a vertex from each group of k vertices x x

Implementation in the IID Model For any p≤n, can implement algorithm with O (p) space and O ( m n 2 p 2 ) samples Set k = O(n/p) Even for p = O(1), this improves our earlier O(log n) space and O m n 3 samples Phase 1: sample O (p) nodes at random, run random walk on each of them estimate in O (1) space the number of distinct nodes visited in each walk Phase 2: sample O (p) nodes at random, run random walk on each of them keep track of which sampled nodes are connected

Talk Outline Sample/Space Tradeoff for Collision Probability Estimation Sample/Space Tradeoff for Deciding Connectivity Sample/Space Tradeoff for Determining if a Subspace is Full Rank

Determining if a Subspace Has Full Rank Given t IID samples from a subspace S of GF 2 d , is rank(S) = d/2 or rank(S) = d? O(log d) space algorithm: choose a random standard unit vector e i and check if any sample equals e i After O 2 d samples, if S has full rank, will see e i If S has rank d/2, with probability ½, will never see e i Our Lower Bound: Any algorithm succeeding with probability > 2/3 and using o(d) space must use 2 Ω d samples

Statistical Query Framework A “statistical query” algorithm (s.q. algorithm) is adaptively proposes functions f 1 , f 2 , … with f i : 0,1 d → −1, 1 and receives estimates of E x f i x that are corrupted via adversarial noise An s-query s.q. algorithm with tolerance τ is an algorithm that: for every rank d/2 subspace S, after querying f 1 ,…, f s with responses r 1 ,…, r s with r i − E x uniform in S [ f i x ] ≤τ, it outputs “rank d/2” with probability 3/4 if the responses r 1 ,…, r s satisfy r i − E x uniform in 0,1 d [ f i (x) ≤τ, the algorithm outputs “full rank” with probability 3/4 Main Lemma: For any f i , Pr S E x uniform on 0,1 d f i x − E x uniform on S f i x > 2 − d 8 ≤ 2 − d 4

Statistical Query Framework Each of t players in a communication game receives a sample from GF 2 d P 1 P 2 P 3 … P t Each message depends on all previous messages and is at most s bits [SVG] If there is a communication protocol with probability 1-δ, then there is an O(st) query s.q. algorithm with probability 1-2δ with tolerance δ t 2 s Set s=o d and t= 2 Ω(d) and apply our main lemma

Conclusions Studied space versus sample tradeoffs in the data stream model Obtained tradeoffs for statistical, graph, and linear algebra problems Open questions: tighten our bounds General question: unify the techniques for the different problems