Data Stream Processing (Part I) Alon,, Matias, Szegedy. “The space complexity of approximating the frequency moments”, ACM STOC’1996. Alon, Gibbons, Matias,

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
Estimating Join-Distinct Aggregates over Update Streams Minos Garofalakis Bell Labs, Lucent Technologies (Joint work with Sumit Ganguly, Amit Kumar, Rajeev.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Data Stream Mining and Querying
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Approximation Techniques for Data Management Systems “We are drowning in data but starved for knowledge” John Naisbitt CS 186 Fall 2005.
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
Data Stream Processing (Part IV)
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Skip Lists1 Skip Lists William Pugh: ” Skip Lists: A Probabilistic Alternative to Balanced Trees ”, 1990  S0S0 S1S1 S2S2 S3S3 
1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.
Data Stream Processing (Part III) Gibbons. “Distinct sampling for highly accurate answers to distinct values queries and event reports”, VLDB’2001. Ganguly,
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Graham Cormode S. Muthukrishnan
Intel Research Sketching Streams through the Net: Distributed Approximate Query Tracking Graham Cormode Bell Laboratories Minos Garofalakis.
Data Stream Algorithms Intro, Sampling, Entropy
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Data Mining: Concepts and Techniques Mining data streams
Calculating frequency moments of Data Stream
Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Frequency Counts over Data Streams
The Variable-Increment Counting Bloom Filter
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Streaming & sampling.
Big Data Infrastructure
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
AQUA: Approximate Query Answering
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
By: Ran Ben Basat, Technion, Israel
Presentation transcript:

Data Stream Processing (Part I) Alon,, Matias, Szegedy. “The space complexity of approximating the frequency moments”, ACM STOC’1996. Alon, Gibbons, Matias, Szegedy. “Tracking Join and Self-join Sizes in Limited Storage”, ACM PODS’1999. SURVEY-1: S. Muthukrishnan. “Data Streams: Algorithms and Applications” SURVEY-2: Babcock et al. “Models and Issues in Data Stream Systems”, ACM PODS’2002.

2 Data-Stream Management data sets  Traditional DBMS – data stored in finite, persistent data sets  Data Streams – distributed, continuous, unbounded, rapid, time varying, noisy,...  Data-Stream Management – variety of modern applications –Network monitoring and traffic engineering –Telecom call-detail records –Network security –Financial applications –Sensor networks –Manufacturing processes –Web logs and clickstreams –Massive data sets

3 Networks Generate Massive Data Streams Broadband Internet Access Converged IP/MPLS Network PSTN DSL/Cable Networks Enterprise Networks Voice over IP FR, ATM, IP VPN Network Operations Center (NOC) SNMP/RMON, NetFlow records BGP OSPF Peer  SNMP/RMON/NetFlow data records arrive 24x7 from different parts of the network  Truly massive streams arriving at rapid rates –AT&T collects GigaBytes of NetFlow data each day!  Typically shipped to a back-end data warehouse (off site) for off-line analysis Source Destination Duration Bytes Protocol K http K http K http K http K http K ftp K ftp K ftp Example NetFlow IP Session Data

4 Packet-Level Data Streams  Single 2Gb/sec link; say avg packet size is 50bytes  Number of packets/sec = 5 million  Time per packet = 0.2 microsec  If we only capture header information per packet: src/dest IP, time, no. of bytes, etc. – at least 10bytes. –Space per second is 50Mb –Space per day is 4.5Tb per link –ISPs typically have hundred of links!  Analyzing packet content streams – whole different ballgame!!

5 Real-Time Data-Stream Analysis  Need ability to process/analyze network-data streams in real-time –As records stream in: look at records only once in arrival order! –Within resource (CPU, memory) limitations of the NOC  Critical to important NM tasks –Detect and react to Fraud, Denial-of-Service attacks, SLA violations –Real-time traffic engineering to improve load-balancing and utilization DBMS (Oracle, DB2) Back-end Data Warehouse Off-line analysis – Data access is slow, expensive Converged IP/MPLS Network PSTN DSL/Cable Networks Enterprise Networks Network Operations Center (NOC) BGP Peer R1 R2 R3 What are the top (most frequent) 1000 (source, dest) pairs seen by R1 over the last month? SELECT COUNT (R1.source, R1.dest) FROM R1, R2 WHERE R1.source = R2.source SQL Join Query How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3? Set-Expression Query

6 IP Network Data Processing  Traffic estimation –How many bytes were sent between a pair of IP addresses? –What fraction network IP addresses are active? –List the top 100 IP addresses in terms of traffic  Traffic analysis –What is the average duration of an IP session? –What is the median of the number of bytes in each IP session?  Fraud detection –List all sessions that transmitted more than 1000 bytes –Identify all sessions whose duration was more than twice the normal  Security/Denial of Service –List all IP addresses that have witnessed a sudden spike in traffic –Identify IP addresses involved in more than 1000 sessions

7 Overview  Introduction & Motivation  Data Streaming Models & Basic Mathematical Tools  Summarization/Sketching Tools for Streams –Sampling –Linear-Projection (aka AMS) Sketches Applications: Join/Multi-Join Queries, Wavelets –Hash (aka FM) Sketches Applications: Distinct Values, Set Expressions

8 The Streaming Model  Underlying signal: One-dimensional array A[1…N] with values A[i] all initially zero –Multi-dimensional arrays as well (e.g., row-major)  Signal is implicitly represented via a stream of updates –j-th update is implying A[k] := A[k] + c[j] (c[j] can be >0, <0)  Goal: Compute functions on A[] subject to –Small space –Fast processing of updates –Fast function computation –…

9 Example IP Network Signals  Number of bytes (packets) sent by a source IP address during the day –2^(32) sized one-d array; increment only  Number of flows between a source-IP, destination-IP address pair during the day –2^(64) sized two-d array; increment only, aggregate packets into flows  Number of active flows per source-IP address –2^(32) sized one-d array; increment and decrement

10 Streaming Model: Special Cases  Time-Series Model –Only j-th update updates A[j] (i.e., A[j] := c[j])  Cash-Register Model – c[j] is always >= 0 (i.e., increment-only) –Typically, c[j]=1, so we see a multi-set of items in one pass  Turnstile Model –Most general streaming model – c[j] can be >0 or <0 (i.e., increment or decrement)  Problem difficulty varies depending on the model –E.g., MIN/MAX in Time-Series vs. Turnstile!

11 Data - Stream Processing Model  Approximate answers often suffice, e.g., trend analysis, anomaly detection  Requirements for stream synopses –Single Pass: Each record is examined at most once, in (fixed) arrival order –Small Space: Log or polylog in data stream size –Real-time: Per-record processing time (to maintain synopses) must be low –Delete-Proof: Can handle record deletions as well as insertions –Composable: Built in a distributed fashion and combined later Stream Processing Engine Approximate Answer with Error Guarantees “Within 2% of exact answer with high probability” Stream Synopses (in memory) Continuous Data Streams Query Q R1 Rk (GigaBytes) (KiloBytes)

12 Data Stream Processing Algorithms  Generally, algorithms compute approximate answers –Provably difficult to compute answers accurately with limited memory  Approximate answers - Deterministic bounds –Algorithms only compute an approximate answer, but bounds on error  Approximate answers - Probabilistic bounds –Algorithms compute an approximate answer with high probability With probability at least, the computed answer is within a factor of the actual answer  Single-pass algorithms for processing streams also applicable to (massive) terabyte databases!

13 Sampling: Basics  Idea: A small random sample S of the data often well-represents all the data –For a fast approx answer, apply “modified” query to S –Example: select agg from R where R.e is odd (n=12) –If agg is avg, return average of odd elements in S –If agg is count, return average over all elements e in S of n if e is odd 0 if e is even Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer Data stream: Sample S: answer: 5 answer: 12*3/4 =9

14 Probabilistic Guarantees  Example: Actual answer is within 5 ± 1 with prob  0.9  Randomized algorithms: Answer returned is a specially- built random variable  Use Tail Inequalities to give probabilistic bounds on returned answer –Markov Inequality –Chebyshev’s Inequality –Chernoff Bound –Hoeffding Bound

15 Basic Tools: Tail Inequalities  General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation)  Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any Probability distribution Tail probability Markov: Chebyshev:

16 Tail Inequalities for Sums  Possible to derive stronger bounds on tail probabilities for the sum of independent random variables  Hoeffding’s Inequality: Let X1,..., Xm be independent random variables with 0<=Xi <= r. Let and be the expectation of. Then, for any,  Application to avg queries: –m is size of subset of sample S satisfying predicate (3 in example) –r is range of element values in sample (8 in example)  Application to count queries: –m is size of sample S (4 in example) –r is number of elements n in stream (12 in example)  More details in [HHW97]

17 Tail Inequalities for Sums  Possible to derive even stronger bounds on tail probabilities for the sum of independent Bernoulli trials  Chernoff Bound: Let X1,..., Xm be independent Bernoulli trials such that Pr[Xi=1] = p (Pr[Xi=0] = 1-p). Let and be the expectation of. Then, for any,  Application to count queries: –m is size of sample S (4 in example) –p is fraction of odd elements in stream (2/3 in example)  Remark: Chernoff bound results in tighter bounds for count queries compared to Hoeffding’s inequality

18 Overview  Introduction & Motivation  Data Streaming Models & Basic Mathematical Tools  Summarization/Sketching Tools for Streams –Sampling –Linear-Projection (aka AMS) Sketches Applications: Join/Multi-Join Queries, Wavelets –Hash (aka FM) Sketches Applications: Distinct Values, Set Expressions

19 Computing Stream Sample  Reservoir Sampling [Vit85]: Maintains a sample S of a fixed-size M –Add each new element to S with probability M/n, where n is the current number of stream elements –If add an element, evict a random element from S –Instead of flipping a coin for each element, determine the number of elements to skip before the next to be added to S  Concise sampling [GM98]: Duplicates in sample S stored as pairs (thus, potentially boosting actual sample size) –Add each new element to S with probability 1/T (simply increment count if element already in S) –If sample size exceeds M Select new threshold T’ > T Evict each element (decrement count) from S with probability 1-T/T’ –Add subsequent elements to S with probability 1/T’

20 Synopses for Relational Streams  Conventional data summaries fall short –Quantiles and 1-d histograms [MRL98,99], [GK01], [GKMS02] Cannot capture attribute correlations Little support for approximation guarantees –Samples (e.g., using Reservoir Sampling) Perform poorly for joins [AGMS99] or distinct values [CCMN00] Cannot handle deletion of records –Multi-d histograms/wavelets Construction requires multiple passes over the data  Different approach: Pseudo-random sketch synopses –Only logarithmic space –Probabilistic guarantees on the quality of the approximate answer –Support insertion as well as deletion of records (i.e., Turnstile model)

21 Linear-Projection (aka AMS) Sketch Synopses  Goal:  Goal: Build small-space summary for distribution vector f(i) (i=1,..., N) seen as a stream of i-values  Basic Construct:  Basic Construct: Randomized Linear Projection of f() = project onto inner/dot product of f-vector –Simple to compute over the stream: Add whenever the i-th value is seen –Generate ‘s in small (logN) space using pseudo-random generators –Tunable probabilistic guarantees on approximation error –Delete-Proof: Just subtract to delete an i-th value occurrence –Composable: Simply add independently-built projections Data stream: 3, 1, 2, 4, 2, 3, 5,... f(1) f(2) f(3) f(4) f(5) where = vector of random values from an appropriate distribution

22 Example: Binary-Join COUNT Query  Problem: Compute answer for the query COUNT(R A S)  Example:  Exact solution: too expensive, requires O(N) space! –N = sizeof(domain(A)) Data stream R.A: Data stream S.A: = 10 ( )

23 Basic AMS Sketching Technique [AMS96]  Key Intuition: Use randomized linear projections of f() to define random variable X such that –X is easily computed over the stream (in small space) –E[X] = COUNT(R A S) –Var[X] is small  Basic Idea: –Define a family of 4-wise independent {-1, +1} random variables –Pr[ = +1] = Pr[ = -1] = 1/2 Expected value of each, E[ ] = 0 –Variables are 4-wise independent Expected value of product of 4 distinct = 0 –Variables can be generated using pseudo-random generator using only O(log N) space (for seeding)! Probabilistic error guarantees (e.g., actual answer is 10 ± 1 with probability 0.9)

24 AMS Sketch Construction  Compute random variables: and –Simply add to X R (X S ) whenever the i-th value is observed in the R.A (S.A) stream  Define X = X R X S to be estimate of COUNT query  Example: Data stream R.A: Data stream S.A:

25 Binary-Join AMS Sketching Analysis  Expected value of X = COUNT(R A S)  Using 4-wise independence, possible to show that  is self-join size of R 0 1