Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)

Similar presentations


Presentation on theme: "Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)"— Presentation transcript:

1 Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)

2 The Story Begins with...

3 The Model 1421345 235212 Alice observes A(t) by time t Bob observes B(t) by time t A(t), B(t): multisets Carole tries to compute f (A(t) U B(t)) for all t All parties have infinite computing power Goal is to minimize communication t

4 The Model 1421345 235212 231313 253322 k sites Continuous Communication Model / Distributed Streaming Model

5 Combination of Two Models 3 1 1 24 2 3 1 1 24 2 Communication model 14213 Streaming model Continuous Communication Model Distributed Streaming Model One-shot Model “ ”

6 Other Models [Gibbons and Tirthapura, 2001] 1421345 235212 Carole tries to compute f (A U B) in the end All parties make one pass using small memory  small communication t

7 Applied Motivation: Distributed Monitoring Large-scale querying/monitoring: Inherently distributed!  Streams physically distributed across remote sites E.g., stream of UDP packets through routers Challenge is “holistic” querying/monitoring  Queries over the union of distributed streams Q(S 1 ∪ S 2 ∪ …)  Streaming data is spread throughout the network Network Operations Center (NOC) Query site Query 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 0 1 1 Q(S 1 ∪ S 2 ∪ …) S6S6 S5S5 S4S4 S3S3 S1S1 S2S2 Slide from the tutorial “Streaming in a connected world: Querying and tracking distributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis]

8 Applied Motivation: Distributed Monitoring Traditional approach: “pull” based  Query all nodes once for a while  Expensive communication, most is wasted  Inaccurate Current trend: moving towards a “push” based approach  The remote sites alert the coordinator when something interesting happens Network Operations Center (NOC) Query site Query 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 0 1 1 Q(S 1 ∪ S 2 ∪ …) S6S6 S5S5 S4S4 S3S3 S1S1 S2S2

9 Theoretical Questions Upper bounds: Worst-case communication bounds for a given f ? Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model?

10 The Frequency Moments Assume integer domain [ n ] = { 1, …, n } i appears m i times The p -th frequency moment: F 1 is the cardinality of A F 0 is # unique items in A (define 0 0 =0 ) F 2 is  Gini’s index of homogeneity in statistics  self-join size in db Extensively studied since [Alon, Matias, and Szegedy, 1999]

11 Approximate Monitoring Must trigger alarm when F p > τ Cannot trigger alarm when F p < (1 − ε) τ Why approximate: Exact monitoring is expensive and unnecessary Why monitoring  Most applications only need monitoring  Tracking can be simulated by monitoring with τ = 1+ε, (1+ε) 2, (1+ε) 3, …, so at most an O(1/ε) factor away. time FpFp τ (1 − ε) τ alarm

12 Prior Work Several papers in the database literature  Mostly heuristic based  Bad worst-case bounds, no lower bounds F 1 : O(k/ε log(τ/k)) [SIGMOD’06] F 0 : Õ(k 2 /ε 3 ) [ICDE’06] F 2 : Õ(k 2 /ε 4 ) [VLDB’05] Õ() suppresses polylog factors O(k log(1/ε)) Õ(k/ε 2 ) Õ(k 2 /ε+k 3/2 /ε 3 )

13 Continuous vs One-Shot If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits

14 Our Results Good news: all continuous bounds (except F 2 ) are close to their one-shot counterparts Bad news: all continuous bounds (except F 2 ) are close to their one-shot counterparts

15 Talk Outline Introduction Deterministic F 1 algorithm: O(k log(1/ε)) Randomized F 1 algorithm: O(1/ε 2 ∙log(1/δ)) Randomized F 0 algorithm: Õ(k/ε 2 ) Randomized F 2 algorithm: Õ(k 2 /ε+k 3/2 /ε 3 ) Conclusions

16 Deterministic F 1 Algorithm The first round: τ/2kτ/2k coordinator Terminates round after receiving k signals τ/2k · k = τ/2 < F 1 < τ

17 Deterministic F 1 Algorithm The second round: τ/4kτ/4k coordinator

18 Deterministic F 1 Algorithm The second round: τ/4kτ/4k coordinator Terminates round after receiving k signals 3τ/4 < F 1 < τ

19 Deterministic F 1 Algorithm Each round communicates O(k) bits Continue until Δ=ετ  O(log(1/ε)) rounds Δ=ετΔ=ετ coordinator After the last round, we have (1-ε)τ < F 1 < τ Total communication: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) One-Shot: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk)))

20 Talk Outline Introduction Deterministic F 1 algorithm: O(k log(1/ε)) Randomized F 1 algorithm: O(1/ε 2 ∙log(1/δ)) Randomized F 0 algorithm: Õ(k/ε 2 ) Randomized F 2 algorithm: Õ(k 2 /ε+k 3/2 /ε 3 ) Conclusions

21 F 0 : # Distinct Items Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first  Use “sketches”: small-space streaming algorithms  “Combine” the sketches from the k sites  FM sketch [Flajolet and Martin 1985; Alon, Matias, and Szegedy, 1999]

22 FM Sketch Take a pair-wise independent random hash function h : {1,…,n}  {1,…,2 d }, where 2 d > n For each incoming element x, compute h(x)  e.g., h(5) = 10101100010000  Count how many trailing zeros  Remember the maximum number of trailing zeroes in any h(x) Let Y be the maximum number of trailing zeroes  Can show E[2 Y ] = # distinct elements

23 FM Sketch So 2 Y is an unbiased estimator for # distinct elements However, has a large variance  Some recent techniques [Gibbons and Tirthapura, 2001; Bar- Yossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce a good estimator that has probability 1–δ to be within relative error ε  Space increased to Õ(1/ε 2 ) FM sketch has linearity  Y 1 from A, Y 2 from B, then 2 max{Y 1, Y 2 } estimates # distinct items in A U B A one-shot algorithm with communication Õ(k/ε 2 )

24 Continuously Monitoring F 0 FM sketch is monotone  Y i is non-decreasing, and Y i < log n  Whenever Y i increases, notify the coordinator  The coordinator can always have the up-to- date combined FM sketch  Total communication: Õ(k/ε 2 ) Lower bound : Ω(k)

25 Talk Outline Introduction Deterministic F 1 algorithm: O(k log(1/ε)) Randomized F 1 algorithm: O(1/ε 2 ∙log(1/δ)) Randomized F 0 algorithm: Õ(k/ε 2 ) Randomized F 2 algorithm: Õ(k 2 /ε+k 3/2 /ε 3 ) Conclusions

26 F 2 : The One-Shot Case Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first  Use “sketches”: small-space streaming algorithms  “Combine” the sketches from the k sites  AMS sketch [Alon, Matias, and Szegedy, 1999]

27 AMS Sketch: “Tug-of-War” Take a 4-wise independent random hash function h : {1,…,n}  {−1,+1} Compute Y = ∑ h(x) over all x Y 2 is an unbiased estimator for F 2 Use O(1/ε 2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε Linearity still holds! o One-shot case can be solved with communication Õ(k/ε 2 )

28 However… Y is not monotone! Can’t afford to send all changes of the local sketch to the coordinator

29 F 2 Monitoring: Multi-Round Algorithm Beginning of a round sketch Õ(1/ε 2 ) estimate for F 2 coordinator

30 F 2 Monitoring: Multi-Round Algorithm During a round estimate for F 2 coordinator sends a signal whenever the F 2 of the updates increases by t = (τ − F 2 ) 2 /(64k 2 τ)

31 F 2 Monitoring: Multi-Round Algorithm End of a round: when k signals are received estimate for F 2 coordinator old F 2 + ( τ − old F 2 ) ∙ ε/k < new F2 < τ # rounds: O(k/ε) Total cost: Õ(k 2 /ε 3 ) # rounds: O(k/ε) Total cost: Õ(k 2 /ε 3 )

32 F 2 : Round / Sub-Round Algorithm End of a sub-round: when k signals are received estimate for F 2 coordinator old F 2 + ( τ − old F 2 ) ∙ ε/k < new F2 < τ “rough” sketch of size Õ (1) “rough” sketch of size Õ (1) combine sketches maintain an upper bound of F 2 Total cost: Õ(k 2 /ε+k 3/2 /ε 3 ) One-shot: Õ(k/ε 2 ) Lower bound: Ω(k)

33 Open Problems Still no clear separation between the one-shot model and the continuous model  F 2 is an interesting case Many other functions f  Statistics: entropy, heavy hitters  Geometric measures: diameter, width, … Variations of the model  One-way vs two-way communication  Does having a broadcast channel help?  Sliding windows? “Continuous Communication Complexity”?

34 Thank you!


Download ppt "Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)"

Similar presentations


Ads by Google