Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

An Optimal Algorithm for the Distinct Elements Problem

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Xiaoming Sun Tsinghua University David Woodruff MIT

Data Stream Algorithms Frequency Moments

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

On the Power of Adaptivity in Sparse Recovery Piotr Indyk MIT Joint work with Eric Price and David Woodruff, 2011.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.

1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.

Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.

A survey on stream data mining

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Information Theory for Data Streams David P. Woodruff IBM Almaden.

Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

Energy-Efficient Monitoring of Extreme Values in Sensor Networks Loo, Kin Kong 10 May, 2007.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Data Stream Algorithms Lower Bounds Graham Cormode

Calculating frequency moments of Data Stream

Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

The Message Passing Communication Model David Woodruff IBM Almaden.

Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Mining Data Streams (Part 1)

Information Complexity Lower Bounds

New Characterizations in Turnstile Streams with Applications

Finding Frequent Items in Data Streams

Estimating L2 Norm MIT Piotr Indyk.

Streaming & sampling.

Sublinear Algorithmic Tools 2

Counting How Many Elements Computing “Moments”

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Range-Efficient Counting of Distinct Elements

CSCI B609: “Foundations of Data Science”

Range-Efficient Computation of F0 over Massive Data Streams

Lecture 6: Counting triangles Dynamic graphs & sampling

By: Ran Ben Basat, Technion, Israel

Heavy Hitters in Streams and Sliding Windows

Presentation transcript:

Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)

The Story Begins with...

The Model Alice observes A(t) by time t Bob observes B(t) by time t A(t), B(t): multisets Carole tries to compute f (A(t) U B(t)) for all t All parties have infinite computing power Goal is to minimize communication t

The Model k sites Continuous Communication Model / Distributed Streaming Model

Combination of Two Models Communication model Streaming model Continuous Communication Model Distributed Streaming Model One-shot Model “ ”

Other Models [Gibbons and Tirthapura, 2001] Carole tries to compute f (A U B) in the end All parties make one pass using small memory  small communication t

Applied Motivation: Distributed Monitoring Large-scale querying/monitoring: Inherently distributed!  Streams physically distributed across remote sites E.g., stream of UDP packets through routers Challenge is “holistic” querying/monitoring  Queries over the union of distributed streams Q(S 1 ∪ S 2 ∪ …)  Streaming data is spread throughout the network Network Operations Center (NOC) Query site Query Q(S 1 ∪ S 2 ∪ …) S6S6 S5S5 S4S4 S3S3 S1S1 S2S2 Slide from the tutorial “Streaming in a connected world: Querying and tracking distributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis]

Applied Motivation: Distributed Monitoring Traditional approach: “pull” based  Query all nodes once for a while  Expensive communication, most is wasted  Inaccurate Current trend: moving towards a “push” based approach  The remote sites alert the coordinator when something interesting happens Network Operations Center (NOC) Query site Query Q(S 1 ∪ S 2 ∪ …) S6S6 S5S5 S4S4 S3S3 S1S1 S2S2

Theoretical Questions Upper bounds: Worst-case communication bounds for a given f ? Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model?

The Frequency Moments Assume integer domain [ n ] = { 1, …, n } i appears m i times The p -th frequency moment: F 1 is the cardinality of A F 0 is # unique items in A (define 0 0 =0 ) F 2 is  Gini’s index of homogeneity in statistics  self-join size in db Extensively studied since [Alon, Matias, and Szegedy, 1999]

Approximate Monitoring Must trigger alarm when F p > τ Cannot trigger alarm when F p < (1 − ε) τ Why approximate: Exact monitoring is expensive and unnecessary Why monitoring  Most applications only need monitoring  Tracking can be simulated by monitoring with τ = 1+ε, (1+ε) 2, (1+ε) 3, …, so at most an O(1/ε) factor away. time FpFp τ (1 − ε) τ alarm

Prior Work Several papers in the database literature  Mostly heuristic based  Bad worst-case bounds, no lower bounds F 1 : O(k/ε log(τ/k)) [SIGMOD’06] F 0 : Õ(k 2 /ε 3 ) [ICDE’06] F 2 : Õ(k 2 /ε 4 ) [VLDB’05] Õ() suppresses polylog factors O(k log(1/ε)) Õ(k/ε 2 ) Õ(k 2 /ε+k 3/2 /ε 3 )

Continuous vs One-Shot If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits

Our Results Good news: all continuous bounds (except F 2 ) are close to their one-shot counterparts Bad news: all continuous bounds (except F 2 ) are close to their one-shot counterparts

Talk Outline Introduction Deterministic F 1 algorithm: O(k log(1/ε)) Randomized F 1 algorithm: O(1/ε 2 ∙log(1/δ)) Randomized F 0 algorithm: Õ(k/ε 2 ) Randomized F 2 algorithm: Õ(k 2 /ε+k 3/2 /ε 3 ) Conclusions

Deterministic F 1 Algorithm The first round: τ/2kτ/2k coordinator Terminates round after receiving k signals τ/2k · k = τ/2 < F 1 < τ

Deterministic F 1 Algorithm The second round: τ/4kτ/4k coordinator

Deterministic F 1 Algorithm The second round: τ/4kτ/4k coordinator Terminates round after receiving k signals 3τ/4 < F 1 < τ

Deterministic F 1 Algorithm Each round communicates O(k) bits Continue until Δ=ετ  O(log(1/ε)) rounds Δ=ετΔ=ετ coordinator After the last round, we have (1-ε)τ < F 1 < τ Total communication: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) One-Shot: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk)))

Talk Outline Introduction Deterministic F 1 algorithm: O(k log(1/ε)) Randomized F 1 algorithm: O(1/ε 2 ∙log(1/δ)) Randomized F 0 algorithm: Õ(k/ε 2 ) Randomized F 2 algorithm: Õ(k 2 /ε+k 3/2 /ε 3 ) Conclusions

F 0 : # Distinct Items Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first  Use “sketches”: small-space streaming algorithms  “Combine” the sketches from the k sites  FM sketch [Flajolet and Martin 1985; Alon, Matias, and Szegedy, 1999]

FM Sketch Take a pair-wise independent random hash function h : {1,…,n}  {1,…,2 d }, where 2 d > n For each incoming element x, compute h(x)  e.g., h(5) =  Count how many trailing zeros  Remember the maximum number of trailing zeroes in any h(x) Let Y be the maximum number of trailing zeroes  Can show E[2 Y ] = # distinct elements

FM Sketch So 2 Y is an unbiased estimator for # distinct elements However, has a large variance  Some recent techniques [Gibbons and Tirthapura, 2001; Bar- Yossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce a good estimator that has probability 1–δ to be within relative error ε  Space increased to Õ(1/ε 2 ) FM sketch has linearity  Y 1 from A, Y 2 from B, then 2 max{Y 1, Y 2 } estimates # distinct items in A U B A one-shot algorithm with communication Õ(k/ε 2 )

Continuously Monitoring F 0 FM sketch is monotone  Y i is non-decreasing, and Y i < log n  Whenever Y i increases, notify the coordinator  The coordinator can always have the up-to- date combined FM sketch  Total communication: Õ(k/ε 2 ) Lower bound : Ω(k)

Talk Outline Introduction Deterministic F 1 algorithm: O(k log(1/ε)) Randomized F 1 algorithm: O(1/ε 2 ∙log(1/δ)) Randomized F 0 algorithm: Õ(k/ε 2 ) Randomized F 2 algorithm: Õ(k 2 /ε+k 3/2 /ε 3 ) Conclusions

F 2 : The One-Shot Case Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first  Use “sketches”: small-space streaming algorithms  “Combine” the sketches from the k sites  AMS sketch [Alon, Matias, and Szegedy, 1999]

AMS Sketch: “Tug-of-War” Take a 4-wise independent random hash function h : {1,…,n}  {−1,+1} Compute Y = ∑ h(x) over all x Y 2 is an unbiased estimator for F 2 Use O(1/ε 2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε Linearity still holds! o One-shot case can be solved with communication Õ(k/ε 2 )

However… Y is not monotone! Can’t afford to send all changes of the local sketch to the coordinator

F 2 Monitoring: Multi-Round Algorithm Beginning of a round sketch Õ(1/ε 2 ) estimate for F 2 coordinator

F 2 Monitoring: Multi-Round Algorithm During a round estimate for F 2 coordinator sends a signal whenever the F 2 of the updates increases by t = (τ − F 2 ) 2 /(64k 2 τ)

F 2 Monitoring: Multi-Round Algorithm End of a round: when k signals are received estimate for F 2 coordinator old F 2 + ( τ − old F 2 ) ∙ ε/k < new F2 < τ # rounds: O(k/ε) Total cost: Õ(k 2 /ε 3 ) # rounds: O(k/ε) Total cost: Õ(k 2 /ε 3 )

F 2 : Round / Sub-Round Algorithm End of a sub-round: when k signals are received estimate for F 2 coordinator old F 2 + ( τ − old F 2 ) ∙ ε/k < new F2 < τ “rough” sketch of size Õ (1) “rough” sketch of size Õ (1) combine sketches maintain an upper bound of F 2 Total cost: Õ(k 2 /ε+k 3/2 /ε 3 ) One-shot: Õ(k/ε 2 ) Lower bound: Ω(k)

Open Problems Still no clear separation between the one-shot model and the continuous model  F 2 is an interesting case Many other functions f  Statistics: entropy, heavy hitters  Geometric measures: diameter, width, … Variations of the model  One-way vs two-way communication  Does having a broadcast channel help?  Sliding windows? “Continuous Communication Complexity”?

Thank you!