Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stochastic Streams: Sample Complexity vs. Space Complexity

Similar presentations


Presentation on theme: "Stochastic Streams: Sample Complexity vs. Space Complexity"— Presentation transcript:

1 Stochastic Streams: Sample Complexity vs. Space Complexity
David Woodruff IBM Almaden Joint work with Michael Crouch, Andrew McGregor, and Greg Valiant

2 Motivation (Well-studied) Statistics question: how many samples from a distribution are needed to estimate a property of a distribution? (Well-studied) Streaming question: for a given fixed stream of samples, how much space is needed to estimate a property of a distribution? Our work: understand the tradeoff between the sample and space complexity

3 Model 4 3 7 3 1 1 2 Algorithm sees a stream of i.i.d. samples from a distribution Algorithm only has 1 pass over the samples Goal: understand the tradeoff between the number t of samples needed to solve a problem, versus the space s of the algorithm

4 Problems (Statistics) Given t samples from a distribution p = p 1 ,…, p n on n items, estimate the collision probability i p i 2 up to a 1+ϵ - factor (Graph Problems) Given t independent edges chosen with replacement from a graph G, decide if G is connected (Linear Algebra) Given t independent samples from a subspace S of GF 2 d , determine if S has dimension d/2 or dimension d

5 Talk Outline Sample/Space Tradeoff for Collision Probability Estimation Sample/Space Tradeoff for Deciding Connectivity Sample/Space Tradeoff for Determining if a Subspace is Full Rank

6 Collision Probability
Given t samples from distribution p = p 1 ,…, p n on n items, estimate the collision probability i p i 2 up to a 1+ϵ - factor Collision probability is called F 2 If t=o( n .5 ), impossible with any amount of space Distinguish p = 1 n , …, 1 n vs. p = ( 2 n ,…, 2 n , 0, …, 0) Our algorithm For any t = Ω 𝜖 ( n .5 ), can 1+ϵ -approximate F 2 with t samples and O ϵ (1+ n t ) space

7 Collision Probability Algorithm
Break the t samples into t/w contiguous groups of w samples 4 3 7 3 1 1 Group 1 Group 2 Group 3 For each group of samples a 1 ,…, a w let X i,j =1 if a i = a j , and let X= 1 w(w−1) i≠j X i,j be the probability of a collision on the group Use w log n bits of space to compute an estimate X for a group, and average estimates over t/w groups

8 Collision Probability Algorithm
E X =E X i,j = k p k 2 = F 2 Var X ≤ F 2 w w−1 +Θ( n .5 F 2 w ) Chebyshev’s inequality implies a 1+ϵ -approximation to F 2 with error probability O n wt ϵ n .5 t ϵ 2 Set t> n .5 𝜖 2 , and w=O n tϵ 2

9 Collision Probability Lower Bound
Use lower bound for random order streams Case 1: see a stream a 1 ,…, a t of t < n distinct items from a universe U Case 2: see a stream a 1 ,…, a t of t-r distinct items together with an item i which occurs r times, all from universe U Order of streams is random [AOMP, GH] any streaming algorithm needs Ω t r space to distinguish the cases, even with an infinite random tape (conjectured space: Ω t r 2 )

10 Collision Probability Lower Bound
Choose a random function h: U -> [n] Given a stream a 1 ,…, a t of items in a random order, feed the algorithm for IID streams the stream h(a 1 ),…, h(a t ) a 1 a 2 a 3 a 4 a 5 a t h(a 1 ) h(a 2 ) h(a 3 ) h(a 4 ) h(a 5 ) h(a t )

11 Collision Probability Lower Bound
If a 1 ,…, a t are distinct, obtain IID samples from distribution ( 1 n , …, 1 n ) If a 1 ,…, a t are distinct together with an item i occurring r times, roughly see IID samples from distribution ( 1 n − r nt ,…, 1 n − r nt , r t , 1 n − r nt , …, 1 n − r nt ) If r>t/ n .5 , then F 2 in two cases differs by a constant factor Implies w=Ω( n t 1.5 ) (Conjectured =Ω( n t )) Question: Extend to F k

12 Talk Outline Sample/Space Tradeoff for Collision Probability Estimation Sample/Space Tradeoff for Deciding Connectivity Sample/Space Tradeoff for Determining if a Subspace is Full Rank

13 Graph Connectivity Given t independent edges chosen with replacement from graph G, decide if G is connected Simulate a random walk starting at node 1 Store current vertex If see an edge not incident to the current vertex, discard it Remember first node i which you haven’t seen. Finish when i > n

14 Graph Connectivity 2 Current Vertex: First Untouched Vertex: 1
Start at vertex 1 3 1 done 2 4 3 4 See IID Stream: {1, 4}, {2, 3}, {1, 4}, {3,4}, {1,2}, {2, 3}, {1,2}, {3,4} do nothing do nothing do nothing O(log n) space, and O m (m n 2 ) =O m 2 n 2 samples

15 The Loopy Graph For each vertex v in G, add m− d v self-loops to vertex v The resulting “loopy graph” H is m-regular and has mn edges Perform previous algorithm on H Each sample is important! O(log n) space, but only O mn n 2 =O m n 3 samples

16 Use More Space and Fewer Samples
We have an algorithm with O(log n) space and O m n 3 samples How can we use more space but fewer samples? Take p random walks in loopy graph, and remember current vertex of each one O(p log n) space Use each sample to update all p random walks Issue: random walks not independent Fix: can simulate independence since most walks don’t move on a given sample What to do with p independent random walks?

17 Space/Time Tradeoff for Connectivity [Feige]
For any k, the following is a correct algorithm whp: (Phase 1) Ensure graph has no connected components of size ≤k (Phase 2) Sample n log n k vertices and verify the samples are connected If there is a component with ≤k vertices, declare graph is disconnected Otherwise, suppose we are in phase 2 x x Will sample a vertex from each group of k vertices x x

18 Implementation in the IID Model
For any p≤n, can implement algorithm with O (p) space and O ( m n 2 p 2 ) samples Set k = O(n/p) Even for p = O(1), this improves our earlier O(log n) space and O m n 3 samples Phase 1: sample O (p) nodes at random, run random walk on each of them estimate in O (1) space the number of distinct nodes visited in each walk Phase 2: sample O (p) nodes at random, run random walk on each of them keep track of which sampled nodes are connected

19 Talk Outline Sample/Space Tradeoff for Collision Probability Estimation Sample/Space Tradeoff for Deciding Connectivity Sample/Space Tradeoff for Determining if a Subspace is Full Rank

20 Determining if a Subspace Has Full Rank
Given t IID samples from a subspace S of GF 2 d , is rank(S) = d/2 or rank(S) = d? O(log d) space algorithm: choose a random standard unit vector e i and check if any sample equals e i After O 2 d samples, if S has full rank, will see e i If S has rank d/2, with probability ½, will never see e i Our Lower Bound: Any algorithm succeeding with probability > 2/3 and using o(d) space must use 2 Ω d samples

21 Statistical Query Framework
A “statistical query” algorithm (s.q. algorithm) is adaptively proposes functions f 1 , f 2 , … with f i : 0,1 d → −1, 1 and receives estimates of E x f i x that are corrupted via adversarial noise An s-query s.q. algorithm with tolerance τ is an algorithm that: for every rank d/2 subspace S, after querying f 1 ,…, f s with responses r 1 ,…, r s with r i − E x uniform in S [ f i x ] ≤τ, it outputs “rank d/2” with probability 3/4 if the responses r 1 ,…, r s satisfy r i − E x uniform in 0,1 d [ f i (x) ≤τ, the algorithm outputs “full rank” with probability 3/4 Main Lemma: For any f i , Pr S E x uniform on 0,1 d f i x − E x uniform on S f i x > 2 − d 8 ≤ 2 − d 4

22 Statistical Query Framework
Each of t players in a communication game receives a sample from GF 2 d P 1 P 2 P 3 P t Each message depends on all previous messages and is at most s bits [SVG] If there is a communication protocol with probability 1-δ, then there is an O(st) query s.q. algorithm with probability 1-2δ with tolerance δ t 2 s Set s=o d and t= 2 Ω(d) and apply our main lemma

23 Conclusions Studied space versus sample tradeoffs in the data stream model Obtained tradeoffs for statistical, graph, and linear algebra problems Open questions: tighten our bounds General question: unify the techniques for the different problems


Download ppt "Stochastic Streams: Sample Complexity vs. Space Complexity"

Similar presentations


Ads by Google