June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Xiaoming Sun Tsinghua University David Woodruff MIT
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Discrete Controller Design
THE CENTRAL LIMIT THEOREM
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mergeable Summaries Ke Yi HKUST Pankaj Agarwal (Duke) Graham Cormode (Warwick) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (Aarhus) += ?
Use of Kalman filters in time and frequency analysis John Davis 1st May 2011.
From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efficient Computation of Equi-Depth Histograms.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Chapter 10: Sampling and Sampling Distributions
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Heavy hitter computation over data stream
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams Amit Manjhi, Suman Nath, Phillip B. Gibbons Carnegie Mellon University.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Standard error of estimate & Confidence interval.
Time Series Data Analysis - II
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Sampling Distributions
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
CSCE Database Systems Chapter 15: Query Execution 1.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Population and Sample The entire group of individuals that we want information about is called population. A sample is a part of the population that we.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Homework #2: Functions and Arrays By J. H. Wang Mar. 20, 2012.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
An Effective Coreset Compression Algorithm for Large Scale Sensor Networks Dan Feldman, Andrew Sugaya Daniela Rus MIT.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
Finding Frequent Items in Data Streams
Matrix Sketching over Sliding Windows
Streaming & sampling.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Load Shedding Techniques for Data Stream Systems
Approximate Frequency Counts over Data Streams
CSCI B609: “Foundations of Data Science”
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Maintaining Stream Statistics over Sliding Windows
(Learned) Frequency Estimation Algorithms
Presentation transcript:

June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University

June 16, PODS Sliding Window Model time

June 16, PODS Sliding Window Model time

June 16, PODS Sliding Window Model time SUM = 66

June 16, PODS Sliding Window Model time SUM = 59

June 16, PODS Statistics over Sliding Windows Easy if we store entire window Easy if we store entire window Storing entire window expensive Storing entire window expensive Space: “last 1 hour” 1000 elements/sec Space: “last 1 hour” 1000 elements/sec Focus of much previous work: Focus of much previous work: Compute approximate statistics using limited space

June 16, PODS Contributions Algorithms for computing approximate quantiles and approximate frequency counts over sliding windows Algorithms for computing approximate quantiles and approximate frequency counts over sliding windows Space requirement: Space requirement: є = error parameter є = error parameter N = size of the window N = size of the window Logarithmic in window size (N) Logarithmic in window size (N) (Almost) linear in (Almost) linear in poly-log (, N ) 1є1є 1є

June 16, PODS Contributions over Previous Work Frequency counts: First known algorithm for sliding window model Frequency counts: First known algorithm for sliding window model Quantiles: Improves over [ LLXY `04 ] Quantiles: Improves over [ LLXY `04 ] [LLXY `04] space: [LLXY `04] space: Quadratic in Quadratic in 1 є2 ( ) poly-log (, N ) 1 є 1є

June 16, PODS Rest of the Talk Formal problem specification Formal problem specification Sliding windows Sliding windows (Approximate) frequency counts (Approximate) frequency counts Our algorithms Our algorithms Fixed-size sliding windows Fixed-size sliding windows Variable-size sliding windows Variable-size sliding windows Frequency Counts only, for Quantiles see paper

June 16, PODS Sliding Windows Two abstract window models Two abstract window models Fixed-size sliding windows Fixed-size sliding windows Row-based windows Row-based windows Variable-size sliding windows Variable-size sliding windows Time-based windows, shared windows Time-based windows, shared windows

June 16, PODS Fixed-Size Sliding Windows time Window size (N) = 5

June 16, PODS Fixed-Size Sliding Windows time Window size (N) = 5

June 16, PODS Fixed-Size Sliding Windows time Window size (N) = 5

June 16, PODS Fixed-Size Sliding Windows time Window size (N) = 5

June 16, PODS Variable-Size Sliding Windows time Window size (N) = 5

June 16, PODS Variable-Size Sliding Windows time Window size (N) = 6

June 16, PODS Variable-Size Sliding Windows time Window size (N) = 7

June 16, PODS Variable-Size Sliding Windows time Window size (N) = 6

June 16, PODS Variable-Size Sliding Windows time Window size (N) = 5

June 16, PODS Variable-Size Sliding Windows time Window size (N) = 4

June 16, PODS Variable-Size Sliding Windows time Window size (N) = 3

June 16, PODS Frequency Counts ElementCount Select Element, Count(*) From Multiset Group by Element

June 16, PODS Approximate Frequency Counts Elements and their approximate counts Elements and their approximate counts Approximate Count : Approximate Count : True Count – є M < Approximate Count ≤ True Count True Count – є M < Approximate Count ≤ True Count Error parameter: є Error parameter: є Size of input: M Size of input: M Only elements with Approximate Count > 0 Only elements with Approximate Count > 0 References: [MG ’82, DLM ’02, MM ’02, KSP ’03] References: [MG ’82, DLM ’02, MM ’02, KSP ’03]

June 16, PODS Approximate Frequency Counts Input Size: M = 20 ElementTrue Count Error Error parameter: є = 0.25 Absolute error: є M = 5 Approx. Count

June 16, PODS Approximate Frequency Counts Input Size: M = 20 ElementTrue Count Error Approx. Count Error parameter: є = 0.25 Absolute error: є M = 5

June 16, PODS Approximate Frequency Counts All elements with frequency ≥ єM appear in the output. All elements with frequency ≥ єM appear in the output. There exists an output with ≤ elements. There exists an output with ≤ elements. Theorem: An approximate frequency count of size O( ) can be produced in one pass over the input using O( ) space. Theorem: An approximate frequency count of size O( ) can be produced in one pass over the input using O( ) space. References: [MG ’82, DLM ’02, KSP ’03] References: [MG ’82, DLM ’02, KSP ’03] 1 є 1є 1 є

June 16, PODS Rest of the Talk Formal problem specification Formal problem specification Sliding windows Sliding windows (Approximate) frequency counts (Approximate) frequency counts Our algorithms Our algorithms Fixed-size sliding windows Fixed-size sliding windows Variable-size sliding windows Variable-size sliding windows Frequency Counts only, for Quantiles see paper

June 16, PODS Fixed-Size Sliding Windows Window Size: N Window Size: N Error parameter: є Error parameter: є Absolute error: є N Absolute error: є N

June 16, PODS Overview N

June 16, PODS Overview N

June 16, PODS Overview N

June 16, PODS Overview N

June 16, PODS Overview N

June 16, PODS Overview N

June 16, PODS Overview N

June 16, PODS Overview N

June 16, PODS Overview N

June 16, PODS Details N єNєN 4 1 є log ( ) є 1 є 0 є 2 = O(єN)

June 16, PODS Error Invariant Absolute error of all blocks identical є i N i єNєN 1 є log ( ) = є i Error parameter for block N i Number of elements in block

June 16, PODS Merge Operation N

Block 1Block 2Block1 + Block2 є 2 N 2 ˜ f 2 < - f 2 f 2 f 1 + () є 1 N 1 є 2 N 2 ( + )< ˜ f 1 ˜ f f 1 f 1 f 2 f 2 ˜ f 2 ˜ f 1 ˜ f 1 ˜ f 2 + є 1 N 1 ˜ f 1 < - f 1 - Add approximate counts of elements. Absolute error adds up. True count Approx. count ≤ f 1 ≤ f 2 ≤ f 2 f 1 + ()

June 16, PODS Error Analysis N O(єN) log ( є ) єNєN 1 () O ( є ) 1 ++

June 16, PODS Space Requirement N єNєN 4 1 є log ( ) є 1 є 0 є 2

June 16, PODS Approximate Frequency Counts All elements with frequency ≥ єM appear in the output. All elements with frequency ≥ єM appear in the output. There exists an output with ≤ elements. There exists an output with ≤ elements. Theorem: An approximate frequency count of size O( ) can be produced in one pass over the input using O( ) space. Theorem: An approximate frequency count of size O( ) can be produced in one pass over the input using O( ) space. References: [MG ’82, DLM ’02, KSP ’03] References: [MG ’82, DLM ’02, KSP ’03] 1 є 1є 1 є

June 16, PODS Space Requirement Space required for level-ℓ blocks: 1 є ℓ x N N ℓ Size of approx. count Number of “active” blocks N єN / log ( 1 є ) == 1 є 1 є () Total space : x log () 1 є 1 є 1 є () 2 = 1 є 1 є ()

June 16, PODS Fixed-Size Sliding Windows: Summary Theorem: є-approximate frequency counts can be maintained over a fixed-size sliding window of size N using space. 1 є 1 є log () 2

June 16, PODS Variable-Size Windows Error parameter: є Error parameter: є Variable window size: n Variable window size: n Variable absolute error: єn Variable absolute error: єn

June 16, PODS Fixed-Size Window Algorithm? єNєN 4 1 є log ( ) є 1 є 0 є 2 N

June 16, PODS Fixed-Size Window Algorithm? F (є, N) єNєN n n error parameter = N

June 16, PODS Limited Variability F(є/2, N) computes є-approximate frequency counts for window sizes (N/2 ≤ n ≤ N). F(є/2, N) computes є-approximate frequency counts for window sizes (N/2 ≤ n ≤ N).

June 16, PODS Variable-Size Windows time n F(є/2, N) F(є/2, N/2) F(є/2, 2/є) log (єn) N = 2 ≥ n > N/2 p

June 16, PODS Variable-Size Windows time F(є/2, N) F(є/2, N/2) F(є/2, 2/є) n

June 16, PODS Variable-Size Windows time F(є/2, N) F(є/2, N/2) F(є/2, 2/є) n

June 16, PODS Variable-Size Windows time F(є/2, N) F(є/2, N/2) F(є/2, 2/є) n

June 16, PODS Variable-Size Windows time F(є/2, N/2) F(є/2, 2/є) n

June 16, PODS Variable-Size Windows time F(є/2, N/2) F(є/2, 2/є) n

June 16, PODS Variable-Size Windows time F(є/2, N/2) F(є/2, 2/є) n

June 16, PODS Variable-Size Windows time F(є/2, N/2) F(є/2, 2/є) n F(є/2, N)

June 16, PODS Variable-Size Windows: Summary Theorem: є-approximate frequency counts can be maintained over variable-size windows using 1 є 1 є log () 2 log (є n) space, where n is the current size of the sliding window.

June 16, PODS See Paper for … Randomized algorithms for frequency counts Randomized algorithms for frequency counts Deterministic and randomized algorithms for quantiles Deterministic and randomized algorithms for quantiles A general technique for variable-size window algorithms. A general technique for variable-size window algorithms. Converts fixed-size window algorithms to variable- size window algorithms Converts fixed-size window algorithms to variable- size window algorithms Works for Sum, Bit-Count Works for Sum, Bit-Count

June 16, PODS References used in Talk [DLM ’02]: E. D. Demaine, A. Lopez-Ortiz, and J.I. Munro. Frequency estimation of internet packet streams with limited space. ESA [DLM ’02]: E. D. Demaine, A. Lopez-Ortiz, and J.I. Munro. Frequency estimation of internet packet streams with limited space. ESA [KSP ’03]: R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. TODS [KSP ’03]: R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. TODS [LLXY ’04]: X. Lin, H. Lu, J. Xu, and Y. X. Yu. Continuously maintaining quantile summaries of the most recent N elements over a data stream. ICDE [LLXY ’04]: X. Lin, H. Lu, J. Xu, and Y. X. Yu. Continuously maintaining quantile summaries of the most recent N elements over a data stream. ICDE [MG ’82]: J. Misra, D. Gries. Finding repeated elements. Sci. Comput. Programming [MG ’82]: J. Misra, D. Gries. Finding repeated elements. Sci. Comput. Programming [MM ’02]: G. S. Manku, R. Motwani. Approximate frequency counts over data streams. VLDB [MM ’02]: G. S. Manku, R. Motwani. Approximate frequency counts over data streams. VLDB 2002.