An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Lower Bounds for Additive Spanners, Emulators, and More David P. Woodruff MIT and Tsinghua University To appear in FOCS, 2006.
Data Stream Algorithms Frequency Moments
Ulams Game and Universal Communications Using Feedback Ofer Shayevitz June 2006.
Shortest Vector In A Lattice is NP-Hard to approximate
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Sparse Recovery (Using Sparse Matrices)
VENKATA KIRAN YEDUGUNDLA
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
On ‘Selection and Sorting with Limited Storage’ Graham Cormode Joint work with S. Muthukrishnan, Andrew McGregor, Amit Chakrabarti.
1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Data Stream Mining and Querying
The Goldreich-Levin Theorem: List-decoding the Hadamard code
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by.
The moment generating function of random variable X is given by Moment generating function.
1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Streaming Algorithm Presented by: Group 7 Advanced Algorithm National University of Singapore Min Chen Zheng Leong Chua Anurag Anshu Samir Kumar Nguyen.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Graham Cormode S. Muthukrishnan
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
An Optimal Algorithm for Finding Heavy Hitters
New Characterizations in Turnstile Streams with Applications
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
CS6234 Advanced Algorithms February
Streaming & sampling.
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 4: CountSketch High Frequencies
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Range-Efficient Counting of Distinct Elements
Y. Kotidis, S. Muthukrishnan,
CSCI B609: “Foundations of Data Science”
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Lecture 6: Counting triangles Dynamic graphs & sampling
Presentation transcript:

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003

 We consider the vector initially  The thupdate Data Stream Model

Count-Min Sketch  A Count-Min (CM) Sketch with parameters is represented by a two-dimensional array counts with width and depth. Given parameters, set and. Each entry of the array is initially zero. hash functions are chosen uniformly at random from a pairwise independent family

 Update procedure : When arrives, set

 point query  range queries  inner product queries approx. Approximate Query Answering Using CM Sketches

Point Query  Non-negative case ( ) Theorem 1

PROOF : We introduce indicator variables 1 if 0 otherwise Define the variable By construction,

For the other direction, observe that Markov inequality ■

Time to produce the estimate Time for updates Space used Remark : The constant is used here to minimize the space used.

 General case Theorem 2 PROOF : Chernoff bounds ■

Time to produce the estimate Time for updates Space used

Inner Product Query Set

Theorem 3 PROOF: Markov inequality ■

Time to produce the estimate Time for updates Space used

The application of inner-product computation to Join size estimation (where the vectors generated have non-negative entries) Join size of 2 database relations on a particular attribute : = the number of items in the cartesian product of the 2 relations which agree the value of that attribute : the nr of tuples which have value

ê Collorary 1 The Join size of two relations on a particular attribute can be approximated up to with probability by keeping space.

Range Query  Dyadic range: for parameters  range query dyadic range queries single point query (at most)  For each set of dyadic ranges of length a sketch is kept CM Sketches

Compute the dyadic ranges (at most ) which canonically cover the range Pose that many point queries to the sketches Sum of queries

Theorem 4 Proof :Theorem 1 E(Σ error for each estimator) E(error for each estimator) ■

Time to produce the estimate Time for updates Space used Remark : the guarantee will be more useful when stated without terms of In the approximation bound.

Applications of Count-Min Sketches Quantiles Heavy Hitters  

Quantiles in the Turnstile Model  Do binary searches for ranges whose range sum  Quantiles Items with rank (approx. rank and rank )

 Theorem 5 approximate quantiles can be found with probability at least by keeping a data structure with space The time for insert or delete operation is, and the time to find each quantile on demand is.

Heavy Hitters (cash register case) added to a heap  Heavy Hitters Items whose multiplicity exceeds the fraction (approx. ) 

 Theorem 6 The heavy hitters can be found from an inserts only sequence of length by using CM sketches with space, and time per item. Every item which occurs with count more than time is output, and with probability, no item whose count is less than is output.

Sketching techniques ê tug-of-war Alon, Matias and Szegedy (1996) ê Count sketch Alon, Matias and Szegedy (2002)  Random subset sums Gilbert, Kotidis, Muthukrishnan and Strauss (2002)  Count-min sketch Cormode and Muthukrishnan (2003)

- Linear projections of the vector with appropriately chosen random vectors Computation : Sketch Array pairwise independent hash functions hash function whose range and randomness varies The th entry of the sketch :    

ê tug-of-war is with 4-wise independence ê Count sketch  Random subset sums  Count-min sketch is with 2-wise independence is

MethodQuerySpaceUpdate Time Query Time Randomness Needed Tug-of-warInner-product4-wise Tug-of-warPoint Range 4-wise Rundom subset-sumsRangePairwise Count sketchesPoint 1 1Pairwise Count-Min sketchesPoint Inner-product Range Pairwise