# Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

## Presentation on theme: "Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012."— Presentation transcript:

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012

k-party Number-In-Hand Model P1P1 P2P2 P3P3 PkPk P4P4 … x1x1 x2x2 x3x3 x4x4 xkxk Goals: - compute a function f(x 1, …, x k ) - minimize communication complexity -Player to player communication - Protocol transcript always determines who speaks next -Player to player communication - Protocol transcript always determines who speaks next

k-party Number-In-Hand Model C P1P1 P2P2 P3P3 PkPk … Convenient to introduce a coordinator C All communication goes through the coordinator Communication only affected by a factor of 2 x1x1 x2x2 x3x3 xkxk

Model Motivation Data distributed and stored in the cloud –Impractical to put data on a single device Sensor networks –Communication is power-intensive Network routers –Bandwidth limitations Distributed functional monitoring Authors: Can, Cormode, Huang, Muthukrishnan, Patt- Shamir, Shafrir, Tirthapura, Wang, Yi, Zhao, …

k-Party Number-In-Hand Model C P1P1 P2P2 P3P3 PkPk … x1x1 x2x2 x3x3 xkxk Which functions do we care about? - 8 i, x i 2 {0,1, … n} n - x = x 1 + x 2 + … + x k - f(x) = |x| p = (Σ i x i p ) 1/p - |x| 0 is number of non-zero coordinates - Talk will focus on |x| 0 and |x| 2 For distributed databases: |x| 0 is number of distinct elements |x| 2 2 is known as self-join size |x| 2 useful for regression, low-rank approx For distributed databases: |x| 0 is number of distinct elements |x| 2 2 is known as self-join size |x| 2 useful for regression, low-rank approx Important for applications that the x i are non- negative

Randomized Communication Complexity What is the randomized communication cost of f? i.e., the minimal cost of a protocol, which for every set of inputs, fails in computing f with probability < 1/3 (n) cost for |x| 0 and |x| 2 Reduction from 2-Player Set-Disjointness (DISJ) Alice has a set S µ [n] Bob has a set T µ [n] Either |S Å T| = 0 or |S Å T| = 1 |S Å T| = 1 ! DISJ(S,T) = 1, |S Å T| = 0 ! DISJ(S,T) = 0 [KS, R] (n) communication Prohibitive

Approximate Answers Compute a relation with probability > 2/3: f(x) 2 (1 ± ε) |x| 0 f(x) 2 (1 ± ε) |x| 2 What is the randomized communication cost as a function of k, ε, and n? Will ignore log(nk/ε) factors Understanding dependence on ε is critical, e.g., ε<.01

Previous Results |x| 0 : (k + ε -2 ) and O(k ¢ ε -2 ) |x| 2 : (k + ε -2 ) and O(k ¢ ε -2 )

Our Results |x| 0 : (k + ε -2 ) and O(k ¢ ε -2 ) (k ¢ ε -2 ) |x| 2 : (k + ε -2 ) and O(k ¢ ε -2 ) (k ¢ ε -2 ) First lower bounds to depend on product of k and ε - 2 Implications for data streams: - First tight space lower bound for estimating number of distinct elements without using the Gap-Hamming Problem - Improves lower bound for estimation of |x| p, p > 2

Previous Lower Bounds Lower bounds for |x| 0 and |x| [CMY] (k) [ABC] (ε -2 ) Reduction from Gap-Orthogonality (GAP-ORT) P 1, P 2 have u, v 2 {0,1} ε -2, respectively | ¢ (u, v) – 1/(2ε 2 )| 2/ε [CR, S] (ε -2 ) communication

Talk Outline Lower Bounds –|x| 0 –|x| 2

Lower Bound for |x| 0 Improve bound to optimal (k ¢ ε -2 ) Study a simpler problem: k-GAP-THRESH –Each player P i holds a bit Z i –Z i are i.i.d. Bernoulli( ¯ ) –Decide if i=1 k Z i > ¯ k + ( ¯ k) 1/2 or i=1 k Z i < ¯ k - ( ¯ k) 1/2 Otherwise dont care Rectangle property: for any correct protocol transcript ¿, Z 1, Z 2, …, Z k are independent conditioned on ¿

Rectangle Property of Communication Let r be the randomness of C, P 1, …, P k For any fixed r, the set S of inputs giving rise to a transcript ¿ is a combinatorial rectangle: S = S 1 x S 2 x … x S k If input distribution is a product distribution, conditioned on ¿ and r, inputs are independent Since this holds for every r, inputs are independent conditioned on ¿

k-GAP-THRESH C P1P1 P2P2 P3P3 PkPk … Z1Z1 Z2Z2 Z3Z3 ZkZk The Z i are i.i.d. Bernoulli( ¯ ) Coordinator wants to decide if: i=1 k Z i > ¯ k + ( ¯ k) 1/2 or i=1 k Z i < ¯ k - ( ¯ k) 1/2 By independence of the Z i | ¿, equivalent to C having noisy independent copies of the Z i

A Key Lemma Lemma: For any protocol ¦ which succeeds w.pr. >.99, the transcript ¿ is such that w.pr. > 1/2, for at least k/2 different i, H(Z i | ¿ ) < H(.01 ¯ ) Proof: Suppose ¿ does not satisfy this –With large probability, ¯ k - O( ¯ k) 1/2 i=1 k Z i | ¿ ] < ¯ k + O( ¯ k) 1/2 –Since the Z i are independent given ¿, i=1 k Z i | ¿ is a sum of independent Bernoullis –Since most H(Z i | ¿ ) are large, by anti-concentration, both events occur with constant probability: i=1 k Z i | ¿ > ¯ k + ( ¯ k) 1/2, i=1 k Z i | ¿ < ¯ k - ( ¯ k) 1/2 So ¦ cant succeed with large probability

Composition Idea C P1P1 P2P2 P3P3 PkPk … Z3Z3 Z2Z2 Z1Z1 ZkZk The input to P i in k-GAP-THRESH, denoted Z i, is the output of a 2-party Disjointness (DISJ) instance between C and S i - Let S be a random set of size 1/(4ε 2 ) from {1, 2, …, 1/ε 2 } - For each i, if Z i = 1, then choose T i of size 1/(4ε 2 ) so that DISJ(S, T i ) = 1, else choose T i so that DISJ(S, T i ) = 0 - Distributional complexity of solving DISJ with probability 1- ¯ /100, when DISJ(S,T) = 1 with probability ¯, is (1/ε 2 ) [R] DISJ

Putting it All Together Key Lemma ! For most i, H(Z i | ¿ ) < H(.01 ¯ ) Since H(Z i ) = H( ¯ ) for all i, for most i protocol ¦ solves DISJ(X, Y i ) with probability ¸ 1- ¯ /100 For most i, the communication between C and P i is (ε -2 ) –Otherwise, C could simulate the other players without any communication and contradict lower bound for DISJ(X, Y i ) Total communication is (k ¢ ε -2 ) Can show a reduction to estimating |x| 0

Reduction to |x| 0 Think of C as a player Cs input vector x C is characteristic vector of the set [1/ε 2 ] \ S P i s input vector x i is characteristic vector of the set T i When |T i Å S| = 1, support of x = x C + i x i usually increases by 1 Choose ¯ = £ (1/(ε 2 k)) so that i=1 k Z i = ¯ k +- ( ¯ k) 1/2 = 1/ε 2 +- 1/ε

Talk Outline Lower Bounds –|x| 0 –|x| 2

Lower Bound for Euclidean Norm Improve (k + ε - ) bound to optimal (k ¢ ε -2 ) Use Gap-Orthogonality (GAP-ORT(X, Y)) –GAP-ORT(X,Y) = 1 –Alice, Bob have X, Y 2 {0,1} ε -2 –Decide: | ¢ (X, Y) – 1/(2ε 2 )| 2/ε –Consider uniform distribution on X,Y [KLLRX, CKW] For any protocol ¦ that solves GAP- ORT with constant probability, I(X, Y; ¦ ) = H(X,Y) – H(X,Y | ¦ ) = (1/ε 2 )

Information Implications By chain rule, I(X, Y ; ¦ ) = i=1 1/ε 2 I(X i, Y i ; ¦ | X < i, Y < i ) = (ε -2 ) For most i, I(X i, Y i ; ¦ | X < i, Y < i ) = (1)

XOR DISJ Choose random j 2 [n] and random S 2 {00, 10, 01, 11}: S = 00: j doesnt occur in any T i S = 10: j occurs only in T 1, …, T k/2 S = 01: j occurs only in T k/, …, T k S = 11: j occurs in T 1, …, T k Every j j occurs in at most one set T i Output equals 1 if S 2 {10, 01}, otherwise output is 0 I( ¦ ; T 1, …, T k | j, S, D) = (k) for any ¦ for which I( ¦ ; S) = (1) P1P1 P k/2+1 …PkPk P k/2 T1T1 T k/2+1 T k/2 T k µ [n] We compose GAP-ORT with a variant of k-Party DISJ … ……

GAP-ORT + XOR DISJ Take 1/ε 2 independent copies of XOR DISJ –T i = (T i 1, …, T i k ), j i, S i, D i are variables for i-th instance Is the number of outputs equal to 1 about 1/(2ε 2 ) +-1/ε or about 1/(2ε 2 ) +- 2/ε? XOR DISJ instance … { 1/ε 2

Intuitive Proof GAP-ORT is embedded inside of GAP-ORT + XOR DISJ Output is XOR of bits in S Implies for any correct protocol ¦ : For most i, I(S i ; ¦ | S < i ) = (1) Implies via a direct sum: For most i, I( ¦ ; T i | j, S, D, T < i ) = (k) Implies via the chain rule: I( ¦ ; T 1, …, T 1/ε 2 | j, S, D) = (k/ε 2 ) Implies communication is (k/ε 2 )

Conclusions Tight communication lower bounds for estimating |x| 0 and |x| 2 Techniques imply tight lower bounds for empirical entropy, heavy hitters, quantiles Other results: –Model in which the x i undergo poly(n) additive updates to their coordinates –Coordinator continually maintains (1+ε)-approximation –Improve k 2 /poly(ε) to k/poly(ε) communication for |x| 2

Download ppt "Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012."

Similar presentations