Distributed Set-Expression Cardinality Estimation Abhinandan Das (Cornell U.) Sumit Ganguly (I.I.T. Kanpur) Minos Garofalakis (Bell Labs.) Rajeev Rastogi.

Slides:



Advertisements
Similar presentations
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Scalable Content-Addressable Network Lintao Liu
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Distribution and Revocation of Cryptographic Keys in Sensor Networks Amrinder Singh Dept. of Computer Science Virginia Tech.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Mining Data Streams.
Efficient Constraint Monitoring Using Adaptive Thresholds Srinivas Kashyap, IBM T. J. Watson Research Center Jeyashankar Ramamirtham, Netcore Solutions.
Fast, Memory-Efficient Traffic Estimation by Coincidence Counting Fang Hao 1, Murali Kodialam 1, T. V. Lakshman 1, Hui Zhang 2, 1 Bell Labs, Lucent Technologies.
Distributed Top-K Monitoring. Outline Introduction Related work Algorithm for distributed Top-K monitoring Experiments Summary.
Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
SIA: Secure Information Aggregation in Sensor Networks Bartosz Przydatek, Dawn Song, Adrian Perrig Carnegie Mellon University Carl Hartung CSCI 7143: Secure.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.
Communication-Efficient Distributed Monitoring of Thresholded Counts Ram Keralapura, UC-Davis Graham Cormode, Bell Labs Jai Ramamirtham, Bell Labs.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Cumulative Violation For any window size  t  Communication-Efficient Tracking for Distributed Cumulative Triggers Ling Huang* Minos Garofalakis.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Extending Network Lifetime for Precision-Constrained Data Aggregation in Wireless Sensor Networks Xueyan Tang School of Computer Engineering Nanyang Technological.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T.
SIGMOD'061 Energy-Efficient Monitoring of Extreme Values in Sensor Networks Adam Silberstein Kamesh Munagala Jun Yang Duke University.
Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
Top-k Monitoring in Wireless Sensor Networks Minji Wu, Jianliang Xu, Xueyan Tang, and Wang-Chien Lee IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Distributed Constraint Optimization * some slides courtesy of P. Modi
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Adaptive Stream Filters for Entity-based Queries with Non-value Tolerance VLDB 2005 Reynold Cheng (Speaker) Ben Kao, Alan Kwan Sunil Prabhakar, Yicheng.
Efficient Gathering of Correlated Data in Sensor Networks
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
Network Aware Resource Allocation in Distributed Clouds.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
On Reducing the Global State Graph for Verification of Distributed Computations Vijay K. Garg, Arindam Chakraborty Parallel and Distributed Systems Laboratory.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Getting the Most out of Your Sample Edith Cohen Haim Kaplan Tel Aviv University.
1 Efficient Dependency Tracking for Relevant Events in Shared Memory Systems Anurag Agarwal Vijay K. Garg
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks S. Ganguly M. Garofalakis R. Rastogi K.Sabnani Indian Inst. Of Tech. India Yahoo!
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.
Heuristics for Efficient SAT Solving As implemented in GRASP, Chaff and GSAT.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Mining Data Streams (Part 1)
A paper on Join Synopses for Approximate Query Answering
The Variable-Increment Counting Bloom Filter
RE-Tree: An Efficient Index Structure for Regular Expressions
Optimal Configuration of OSPF Aggregates
Accessing nearby copies of replicated objects
Sublinear Algorithmic Tools 2
Approximate Frequency Counts over Data Streams
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

Distributed Set-Expression Cardinality Estimation Abhinandan Das (Cornell U.) Sumit Ganguly (I.I.T. Kanpur) Minos Garofalakis (Bell Labs.) Rajeev Rastogi (Bell Labs.)

Introduction New class of distributed data streaming applications Remote update streams continuously transmitted to a central system for online querying & analysis Examples Network traffic statistics, call detail records, Web usage logs, sensor data Network monitoring (DDoS) query: Number of distinct source IP addresses observed in flows across an ISP’s border routers

Example Applications Network Monitoring: Detecting DDoS attacks Web content delivery service: Akamai Redirect users to geographically closest or least loaded server Example query: Number of users that access website A but not website B Online mining of web click-streams Placing advertisements on pages Determining the servers at which to replicate web sites

Set-Expression Cardinality Tracking Estimate the number of distinct values in the result of an arbitrary set expression over distributed data streams Operators: union, intersection, difference ( , ,-) Generalization of distinct count estimation for single streams Akamai example: |S A  S B – S c |= #users who visit site A and site B but not site C

Objective Important metric in monitoring applications: Minimizing communication overhead Naïve approach infeasible Eg. AT&T’s backbone routers: 500GB data/day Exact answers usually not required Trade off answer accuracy for reduced data communication costs Provable approximation error guarantees

Outline Model and problem formulation Estimating single stream cardinality Estimating cardinality of arbitrary set expressions Experimental results Conclusions and related work

System Model m+1 sites, n streams S i,j multisets from domain [M]={0,…M-1} S i =  j=1..m S i,j (i=1..n) Stream updates

Problem Formulation Estimate |E|, E=set expression over S 0,…S n-1 Absolute error tolerance  Minimize communication Site 1 S 0,1 ={a} S 1,1 ={a,b} Site 2 S 1,2 ={c}S 0,2 ={b} S0S0 S 1 ={a,b,c} E= S 0  S 1  S 0 ={a,b} S1S1 ={a,b}  |E|=2

Outline Model and problem formulation Estimating single stream cardinality Estimating cardinality of arbitrary set expressions Experimental results Conclusions and related work

Estimating Single Stream Cardinality E=S 0 where S 0 =  j=1..m S 0,j Basic approach Distribute error tolerance  among m sites, allocating budget  j  0 to site j s.t.  j  j =  Possible allocation approaches Proportional to stream update rates Uniform (  j =  /m)

Single Stream Approach: Overview S’ i,j = most recent state of substream S i,j communicated by site j to coordinator For each stream S i, coordinator constructs global state S i ’ as S i ’=  j S’ i,j Coordinator estimates cardinality of set expression E as |E’| Site 1Site 2Site m … S i,1 S i,2 S i,m Site 0 S’ i,1 S’ i,2 S’ i,3 E’=f(S’ i,1,…S’ i,m )

Error Guarantees Need to ensure Correctness: |E|-   |E’|  |E|+  Naïve approach for E=S i Each remote site j sends current state S i,j to coordinator if | S i,j – S’ i,j |>  j or | S’ i,j – S i,j |>  j Can show this ensures correctness

Naïve Charging Scheme Intuitively, associate charge  j ( e) with every element e at every remote site j Each insert charged 1:  j + ( e)++ Each delete charged 1:  j - ( e)++ If total charges at any site j exceed  j, site communicates state to coordinator

Exploiting Global Knowledge Key idea: In many stream application domains, there exist a certain subset of `globally popular’ elements e.g.: IP network monitoring – Destination IP addresses such as Yahoo, CNN, etc. Updates to popular elements can be charged less

Exploiting Global Knowledge (contd…) Site 1 e Site 3 Site m e Site 2 e …  (e)=3 e  3 + (e)=0  2 - (e)=1/3 Site 4

Coordinator Actions Maintains counts of the number of remote sites containing e in S’ i,j Frequent elements (counts   ) added to set F i Coordinator computes a lower bound  i (e)  e  F i, with invariant  i (e)  count i (e) Changes in  i (e) or F i propagated to remote sites To control message overhead Avoid frequent updates to  i (e) and F i

Remote Site Actions Whenever an element e is inserted or deleted; or F i or  i (e) changes: Compute new charges  j + (e),  j - (e) Update total site charge  j +,  j - If  j + >  j or  j - >  j propagate all new changes to coordinator, reset all  ’s

Outline Model and problem formulation Estimating single stream cardinality Estimating cardinality of arbitrary set expressions Experimental results Conclusions and related work

Generalizing to Arbitrary Set Expressions Cardinality estimation for arbitrary expression E involving S 0,…S n-1 and set operators , ,- Generalized scheme identical to single stream solution except for charging procedure

Generalized Charging Schemes Naïve approach: Set  j (e)=1 if e is inserted or deleted from any substream Too conservative: Overcharges Eg: E = S 1  (S 2 - S 3 ) Suppose e  S’ 3,j and e  S 3,j Can set  j + (e)=  j - (e)=0

Model Based Charging Scheme Overview: Construct a boolean formula  j that captures the semantics of expression E as well as the local and global information available at each site Use formula to determine scenarios modifying |E|

Constructing Boolean Formula  j Boolean variables p i and p’ i with semantics e  S i and e  S’ i respectively E = S 1  S 2  F E =p 1  p 2   ,   , -   ¬ F’ E = p’ 1  p’ 2  j + = F E  ¬ F’ E = ( p 1  p 2 )  ( ¬ p’ 1  ¬ p’ 2 ) Specifies conditions that must be satisfied to ensure e  E-E’  j - = ¬ F E  F’ E

Incorporating Local Knowledge Suppose E = S 1  S 2 e  S 1,j  e  S 1 and hence p 1 must be true  j + = (F E  ¬ F’ E )  p 1  j + = (F E  ¬ F’ E )  G j G j = local state formula e  S i,j  Variable p i is added to G j e.g.: e  S 1,j and e  F 2  G j =p 1  p’ 2  j - = ( ¬ F E  F’ E )  G j

Significance of  j Model: Assignment of truth values to variables in a boolean formula that satisfies the formula Every model M satisfying  j represents (from viewpoint of site j) a possible scenario for states S’ i, S i consistent with local information

Model Based Charging Scheme Multiple models for  j + possible A charge  j (M) is assigned to every model M satisfying  j + at site j  j + (e)=max{  j (M): M satisfies  j + } e  E: 1  1, 1  0 (  2 (e)=2) S 1,j S 2,j  e: 1  0 (  1 (e)=4) Determining  j (M): Details in paper

Hardness Result Maximum Charge Model Problem: Given expression E, site j, element e and constant k, does there exist a model M satisfying  j + for which  j (M)  k ? NP Complete Reduction from 3-SAT

Charge Computation Heuristic Works on expression tree Tracks culprit streams at each node of expression tree Bottom up computation Use culprit at root to determine charge See paper for details S1S1 S2S2 S3S3  _

Analysis of Heuristic Computational complexity: O(s) Correctness Lemma: If E is a set expression in which each stream appears at most once, tree based heuristic computes identical charge values as the model based approach

Outline Model and problem formulation Estimating single stream cardinality Estimating cardinality of arbitrary set expressions Experimental results Conclusions and related work

Experimental Setup Comparison of Tree Based and Naïve approaches m=16 sites ;  j =  / m Synthetic Dataset 10 6 stream updates Updated element chosen from Zipfian Site chosen uniformly at random Performance metric: #messages

Single Stream Cardinality Estimation

Set Expression Cardinality Estimation E 1 =(S 1 - S 2 )  S 3 E 2 =(S 1  S 2 )  S 3

Real Life Dataset LBL-TCP-3 dataset /LBL-TCP-3.html Used 500,000 records from dataset Timestamp, src. IP, dest. IP, next hop IP Sliding window of 2 seconds, m=16 sites

Related Work Most work on streams focuses on memory efficient algorithms for a single stream Quantiles [GK01,GKMS02,CM04], set expression cardinality [GGR03], distinct values [Gib01], frequent elements [CCF02] etc. Most similar to Olston et. al. [OJW03, BO03] [OJW03]: Aggregation queries tracking sums [BO03]: Track top-k items at coordinator Our naïve algorithm adapts scheme of [OJW03]

Concluding Remarks Distributed Framework for Set Expression Cardinality Estimation Minimize communication while providing guarantees Exploit Global Knowledge Exploit Set Expression semantics Experimental results Factor of 2 to 20 improvement over naive Higher savings for skewed data

Thank You! Questions ?

Charge Triple Computation: Example E = S 1  (S 2 -S 3 ) e  F 3,  3 (e)=4 i=1i=2i=3 S’ i,j ee S i,j ee S1S1 S2S2 S3S3  _ (1,1,  ) (1,0,3) (1,1,  ) (0,1,1) (0,0,  ) (0,1,3) (0,0,  ) (0,0,1) (0,1,3)  (S 1 )=  (S 2 )=1  (S 3 )=1/4  j + (e)=  ( S 3 )=1/4  j - (e)=0 (  ) ()()

Symbols      S i,j  e e       I        j + (e)=0   ¬ S i,j   

Model Based Scheme: Example E = S 1  (S 2 -S 3 ) States at site j  e  F 3,  3 (e)=4  (S 1 )=  (S 2 )=1,  (S 3 )=1/4  j + =(¬p’ 1  ¬p’ 2  p’ 3 )  (p 1  p 2  ¬p 3 )  (p 1  p’ 2  p 2  p’ 3 ) {p’ 3, ¬p 3 }  M (For any model M) S 3 has local state change at site j  j (M)=  (S 3 )=1/4   j + (e)=1/4  j - unsatisfiable   j - (e)=0 i=1i=2i=3 S’ i,j ee S i,j ee

Charge Computation Heuristic Tracks culprit streams at each node of expression tree using `charge triples’ Charge triple for model M at a node V is t(M,V) = (a,b,x) a=1 if M satisfies F’ E(V), a=0 else b=1 if M satisfies F E(V), b=0 else x=index of culprit stream for M in V’s subtree (x=  if no stream in subtree V have global state change) Heuristic computes triples in bottom-up fashion

Correctness A charging scheme is correct iff it satisfies following two correctness invariants  e  E-E’,  j  j + (e)  1  e  E’-E,  j  j - (e)  1 Charging scheme for single stream case Non frequent elements Charge=1 for each insertion/deletion Frequent elements  j + (e)=0 if e newly inserted  j - (e)=1/  i (e) if e recently deleted

Computing charge  j (M) for model M Suppose E=S 1  S 2 e  S’ 1,j, e  F 1,F 2  j - = (p’ 1  p’ 2 )  (¬p 1  ¬p 2 )  (p’ 1  p’ 2 ) = (p’ 1  ¬p 1 )  (p’ 2  ¬p 2 ) M: e must get deleted from S 1, S 2 globally Uniform culprit selection property Every site selects the same culprit stream S i  P  ( S 1 )=1/4,  ( S 2 )=1/2  culprit=S 1  j (M) = 1/4 since S 1 has local state change at site j (  j (M) = 0 else) e  E: 1  1, 1  0 (  2 (e)=2) S 1,j S 2,j  e: 1  0 e: 0  0 (  1 (e)=4)

Charging the Culprit Stream Charge  (S i ) for culprit stream S i :  (S i ) = 1/  i (e) if e  F i  (S i ) = 1 else Charge  j (M) for model M defined in terms of culprit stream charge  j (M) =  (S i ) if S i has local state change at site j  j (M) = 0 else Lemma: Model based charging scheme is correct

Culprit Stream Selection Select culprit stream to minimize the charge  j + (e) at site j Choose stream in P with smallest charge as culprit Break ties in favor of stream with smaller index Satisfies Uniform Culprit Selection property

N.O.C S1S1