Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Stream Algorithms Lower Bounds Graham Cormode

Similar presentations


Presentation on theme: "Data Stream Algorithms Lower Bounds Graham Cormode"— Presentation transcript:

1 Data Stream Algorithms Lower Bounds Graham Cormode graham@research.att.com

2 Data Stream Algorithms — Lower Bounds 2 Streaming Lower Bounds Lower bounds for data streams – Communication complexity bounds – Simple reductions – Hardness of Gap-Hamming problem – Reductions to Gap-Hamming 1 0 1 1 1 0 1 0 1 … Alice Bob

3 Data Stream Algorithms — Lower Bounds 3 This Time: Lower Bounds So far, have seen many examples of things we can do with a streaming algorithm What about things we can’t do? What’s the best we could achieve for things we can do? Will show some simple lower bounds for data streams based on communication complexity

4 Data Stream Algorithms — Lower Bounds 4 Streaming As Communication Imagine Alice processing a stream Then take the whole working memory, and send to Bob Bob continues processing the remainder of the stream 1 0 1 1 1 0 1 0 1 … Alice Bob

5 Data Stream Algorithms — Lower Bounds 5 Streaming As Communication Suppose Alice’s part of the stream corresponds to string x, and Bob’s part corresponds to string y......and that computing the function on the stream corresponds to computing f(x,y)......then if f(x,y) has communication complexity  (g(n)), then the streaming computation has a space lower bound of  (g(n)) Proof by contradiction: If there was an algorithm with better space usage, we could run it on x, then send the memory contents as a message, and hence solve the communication problem

6 Data Stream Algorithms — Lower Bounds 6 Deterministic Equality Testing Alice has string x, Bob has string y, want to test if x=y Consider a deterministic (one-round, one-way) protocol that sends a message of length m < n There are 2 m possible messages, so some strings must generate the same message: this would cause error So a deterministic message (sketch) must be  (n) bits – In contrast, we saw a randomized sketch of size O(log n) 1 0 1 1 1 0 1 0 1 … 1 0 1 1 0 0 1 0 1 …

7 Data Stream Algorithms — Lower Bounds 7 Hard Communication Problems INDEX : x is a binary string of length n y is an index in [n] Goal: output x[y] Result: (one-way) (randomized) communication complexity of INDEX is  (n) bits DISJOINTNESS : x and y are both length n binary strings Goal: Output 1 if  i: x[i]=y[i]=1, else 0 Result: (multi-round) (randomized) communication complexity of DISJOINTNESS is  (n) bits

8 Data Stream Algorithms — Lower Bounds 8 Simple Reduction to Disjointness F  : output the highest frequency in a stream Input: the two strings x and y from disjointness Stream: if x[i]=1, then put i in stream; then same for y Analysis: if F  =2, then intersection; if F   1, then disjoint. Conclusion: Giving exact answer to F  requires  (N) bits – Even approximating up to 50% error is hard – Even with randomization: DISJ bound allows randomness x: 1 0 1 1 0 1 y: 0 0 0 1 1 0 1, 3, 4, 6 4, 5

9 Data Stream Algorithms — Lower Bounds 9 Simple Reduction to Index F  : output the number of items in the stream Input: the strings x and index y from INDEX Stream: if x[i]=1, put i in stream; then put y in stream Analysis: if (1-  )F’ 0 (x  y)>(1+  )F’(x) then x[y]=1, else it is 0 Conclusion: Approximating F 0 for  <1/N requires  (N) bits – Implies that space to approximate must be  (1/  ) – Bound allows randomization x: 1 0 1 1 0 1 y: 5 1, 3, 4, 6 5

10 Data Stream Algorithms — Lower Bounds 10 Hardness Reduction Exercises Use reductions to DISJ or INDEX to show the hardness of: Frequent items: find all items in the stream whose frequency >  N, for some . Sliding window: given a stream of binary (0/1) values, compute the sum of the last N values – Can this be approximated instead? Min-dominance: given a stream of pairs (i,x(i)), approximate  j min (i, x(i)) x(i) Rank sum: Given a stream of (x,y) pairs and query (p,q) specified after stream, approximate |{(x,y)| x<p, y<q}|

11 Data Stream Algorithms — Lower Bounds 11 Streaming Lower Bounds Lower bounds for data streams – Communication complexity bounds – Simple reductions – Hardness of Gap-Hamming problem – Reductions to Gap-Hamming 1 0 1 1 1 0 1 0 1 … Alice Bob

12 Data Stream Algorithms — Lower Bounds 12 Gap Hamming GAP-HAMM communication problem: Alice holds x  {0,1} N, Bob holds y  {0,1} N Promise: H(x,y) is either  N/2 - p N or  N/2 + p N Which is the case? Model: one message from Alice to Bob Requires  (N) bits of one-way randomized communication [Indyk, Woodruff’03, Woodruff’04, Jayram, Kumar, Sivakumar ’07]

13 Data Stream Algorithms — Lower Bounds 13 Hardness of Gap Hamming Reduction from an instance of INDEX Map string x to u by 1  +1, 0  -1 (i.e. u[i] = 2x[i] -1 ) Assume both Alice and Bob have access to public random strings r j, where each bit of r j is iid {-1, +1} Assume w.l.o.g. that length of string n is odd (important!) Alice computes a j = sign(r j  u) Bob computes b j = sign(r j [y]) Repeat N times with different random strings, and consider the Hamming distance of a 1... a N with b 1... b N

14 Data Stream Algorithms — Lower Bounds 14 Probability of a Hamming Error Consider the pair a j = sign(r j  u), b j = sign(r j [y]) Let w =  i  y u[i] r j [i] – w is a sum of (n-1) values distributed iid uniform {-1,+1} Case 1: w  0. So |w|  2, since (n-1) is even – so sign(a j ) = sign(w), independent of x[y] – Then Pr[a j  b j ] = Pr[sign(w)  sign(r j [y]) = ½ Case 2: w = 0. So a j = sign(r j  u) = sign(w + u[y]r j [y]) = sign(u[y]r j [y]) – Then Pr[a j  b j ] = Pr[sign(u[y]r j [y]) = sign(r j [y])] – This probability is 1 is u[y]=+1, 0 if u[y]=-1 – Completely biased by the answer to INDEX

15 Data Stream Algorithms — Lower Bounds 15 Finishing the Reduction So what is Pr[w=0]? – w is sum of (n-1) iid uniform {-1,+1} values – Textbook: Pr[w=0] = c/  n, for some constant c Do some probability manipulation: – Pr[a j = b j ] = ½ + c/2  n if x[y]=1 – Pr[a j = b j ] = ½ - c/2  n if x[y]=0 Amplify this bias by making strings of length N=4n/c 2 – Apply Chernoff bound on N instances – With prob>2/3, either H(a,b)>N/2 +  N or H(a,b)<N/2 -  N If we could solve GAP-HAMMING, could solve INDEX – Therefore, need  (N) =  (n) bits for GAP-HAMMING

16 Data Stream Algorithms — Lower Bounds 16 Streaming Lower Bounds Lower bounds for data streams – Communication complexity bounds – Simple reductions – Hardness of Gap-Hamming problem – Reductions to Gap-Hamming 1 0 1 1 1 0 1 0 1 … Alice Bob

17 Data Stream Algorithms — Lower Bounds 17 Lower Bound for Entropy Alice: x  {0,1} N, Bob: y  {0,1} N Entropy estimation algorithm A Alice runs A on enc(x) =  (1,x 1 ), (2,x 2 ), …, (N,x N )  Alice sends over memory contents to Bob Bob continues A on enc(y) =  (1,y 1 ), (2,y 2 ), …, (N,y N )  010011 (6,0)(5,1)(4,0)(3,0)(2,1)(1,1) Bob (6,1)(5,1)(4,0)(3,0)(2,1)(1,0) 110010 Alice

18 Data Stream Algorithms — Lower Bounds 18 Lower Bound for Entropy Observe: there are – 2H(x,y) tokens with frequency 1 each – N-H(x,y) tokens with frequency 2 each So, H(S) = log N + H(x,y)/N Thus size of Alice’s memory contents =  (N). Set  = 1/( p (N) log N) to show bound of  (  /log 1/  ) -2 ) 010011 (6,0)(5,1)(4,0)(3,0)(2,1)(1,1) Bob (6,1)(5,1)(4,0)(3,0)(2,1)(1,0) 110010 Alice

19 Data Stream Algorithms — Lower Bounds 19 Lower Bound for F 0 Same encoding works for F 0 (Distinct Elements) – 2H(x,y) tokens with frequency 1 each – N-H(x,y) tokens with frequency 2 each F 0 (S) = N + H(x,y) Either H(x,y)>N/2 +  N or H(x,y)<N/2 -  N – If we could approximate F 0 with  < 1/  N, could separate – But space bound =  (N) =  (  -2 ) bits Dependence on  for F 0 is tight

20 Data Stream Algorithms — Lower Bounds 20 Lower Bounds Exercises 1. Formally argue the space lower bound for F 2 via Gap- Hamming 2. Argue space lower bounds for F k via Gap-Hamming 3. (Research problem) Extend lower bounds for the case when the order of the stream is random or near-random 4. (Research problem) Kumar conjectures the multi-round communication complexity of Gap-Hamming is  (n) – this would give lower bounds for multi-pass streaming

21 Data Stream Algorithms — Lower Bounds 21 Streaming Lower Bounds Lower bounds for data streams – Communication complexity bounds – Simple reductions – Hardness of Gap-Hamming problem – Reductions to Gap-Hamming 1 0 1 1 1 0 1 0 1 … Alice Bob


Download ppt "Data Stream Algorithms Lower Bounds Graham Cormode"

Similar presentations


Ads by Google