Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
MATH 224 – Discrete Mathematics
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
Order Statistics Sorted
Mining Data Streams.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Updated QuickSort Problem From a given set of n integers, find the missing integer from 0 to n using O(n) queries of type: “what is bit[j]
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.
Hashing General idea: Get a large array
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is.
Analysis of Algorithms
Jessie Zhao Course page: 1.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Alternative Wide Block Encryption For Discussion Only.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Ihab Mohammed and Safaa Alwajidi. Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Clustering Data Streams A presentation by George Toderici.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Stochastic Streams: Sample Complexity vs. Space Complexity
Finding Frequent Items in Data Streams
Randomized Algorithms
Streaming & sampling.
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 4: CountSketch High Frequencies
Lecture 7: Dynamic sampling Dimension Reduction
Enumerating Distances Using Spanners of Bounded Degree
Randomized Algorithms
Approximate Frequency Counts over Data Streams
CSCI B609: “Foundations of Data Science”
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17

Finding Frequent Items in Data Streams Today Synopsis Data Structures Sketches and Frequency Moments Finding Frequency Items in Data Streams

Synopsis Data Structures “Lossy” Summary (of a data stream) Advantages – fits in memory + easy to communicate Disadvantage – lossiness implies approximation error Key Techniques – randomization and hashing

Random Samples Goal maintain uniform sample of item-stream Sampling Semantics? Coin flip select each item with probability p easy to maintain undesirable – sample size is unbounded Fixed-size sample without replacement Our focus today Fixed-size sample with replacement Show – can generate from previous sample Non-Uniform Samples [ Chaudhuri-Motwani-Narasayya]

Generalized Stream Model Input Element (i,a) a copies of domain-value i increment to ith dimension of m by a a need not be an integer Data stream: 2, 0, 1, 3, 1, 2, 4,... m 0 m 1 m 2 m 3 m

Example m 0 m 1 m 2 m 3 m On seeing element (i,a) = (2,2) On seeing element (i,a) = (1,-1) m 0 m 1 m 2 m 3 m

Frequency Moments Input Stream values from U = {0,1,…,N-1} frequency vector m = (m 0,m 1,…,m N-1 ) Kth Frequency Moment F k (m) = Σ i m i k F 0 : number of distinct values F 1 : stream size F 2 : Gini index, self-join size, Euclidean norm F k : for k>2, measures skew, sometimes useful F ∞ : maximum frequency

Finding Frequent Items in Data Streams Introduction Main Idea COUNT SKETCH Algorithm Final result

Problem - This work was done while the author was at Google Inc. The Google Problem Return list of k most frequent items in stream Motivation search engine queries, network traffic, … Remember Saw lower bound recently! Solution Data structure Count-Sketch  maintaining count-estimates of high-frequency elements

Introduction (1) One of the most basic problems on a data stream [HRR98,AMS99] is that of finding the most frequently occurring items in the stream We shall assume here that the stream is large enough that memory-intensive solutions such as sorting the stream or keeping a counter for each distinct element are infeasible This problem comes up in the context of search engines, where the streams in question are streams of queries sent to the search engine and we are interested in finding the most frequent queries handled in some period of time.

Introduction (2) A wide variety of heuristics for this problem have been proposed, all involving some combination of sampling, hashing, and counting (see [GM99] and Section 2 for a survey). However, none of these solutions have clean bounds on the amount of space necessary to produce good approximate lists of the most frequent items.

Definitions Notation Assume {1, 2, …, N} in order of frequency m i is frequency of i th most frequent element m = Σm i is number of elements in stream Two notions of approximating the frequent-element problem FindCandidateTop Input: stream S, int k, int p Output: list of p elements containing top k FindApproxTop Input: stream S, int k, real  Output: list of k elements, each of frequency m i > (1-  ) m k

FindCandidateTop for example, that n k = n p+1 + 1, that is, the k-th most frequent element has almost the same frequency as the p + 1st most frequent element. Then it would be almost impossible to find only p elements that are likely to have the top k elements. We therefore define the following variant: FindApproxTop

Main Idea Consider single counter X hash function h(i): {1, 2,…,N}  {-1,+1} Input element i  update counter X += Z i = h(i) For each r, use XZ r as estimator of m r Theorem: E[XZ r ] = m r Proof X = Σ i m i Z i E[XZ r ] = E[Σ i m i Z i Z r ] = Σ i m i E[Z i Z r ] = m r E[Z r 2 ] = m r

A couple of problems The variance of every estimate is very large O(N) elements have estimates that are wrong by more than the variance.

Array of Counters Idea – t counters,c 1,...c t, t hash function h 1,…,h t We can then take the mean or median of these estimates to achieve an estimate with lower variance.

Problem with “Array of Counters” Variance – dominated by highest frequency Estimates for less-frequent elements like k corrupted by higher frequencies Avoiding Collisions? spread out high frequency elements replace each counter with hashtable of b counters

Count Sketch data structure Hash Functions independent hashes h 1,...,h t and s 1,…,s t hashes independent of each other Data structure: hashtables of counters X(r,c) 1 2 … b s 1 : i  {1,..., b} h 1 : i  {+1, -1} s t : i  {1,..., b} h t : i  {+1, -1}

configuration and operations s r (i) – one of b counters in rth hashtable ADD(i): for each r, update X(r,s r (i)) += h r (i) Estimator(m i ) = median r { X(r,s r (i)) h r (i) } Maintain heap of k top elements seen so far

Why we choose median we have not eliminated the problem of collisions with high-frequency elements, and these will still spoil some subset of the estimates. The mean is very sensitive to outliers, while the median is sufficiently robust.

Overall Algorithm 1. Add(i) 2. If i is in the heap, increment its count. Else, add i to the heap if Estimate( m i ) is greater than the smallest estimated count in the heap. In this case, the smallest estimated count should be evicted from the heap. This algorithm solves FindApproxTop where our choice of b will depend on . we can add and subtract. Thealgorithm takes space O(tb + k).And we bound t and b.

Final Results (1) bound t and b t =O( log m/  ), where the algorithm fails with probability at most  b = O(k +  i>k m i 2 / (  m k ) 2 ) (5 lemmas and 1 theorem are listed in the rear) So…..

Final Results (2) FindApproxTop O([k + (  i>k m i 2 ) / (  m k ) 2 ] log m/  ) Zipfian Distribution: m i  1/i  gives improved results compare with Sampling algorithm. Finding items with largest frequency change This problem also has a practical motivation in the context of search engine query streams, since the queries whose frequency changes most between two consecutive time periods can indicate which topics people are currently most interested in [Goo].

5 Lemmas and 1 theorem(1) n q ( l ) be the number of occurrences of element q up to position l. A i [q] be the set of elements that hash onto the same bucket in the i-th row as q does