1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Slides:



Advertisements
Similar presentations
COMP9314Xuemin Continuously Maintaining Order Statistics Over Data Streams Lecture Notes COM9314.
Advertisements

CS4432: Database Systems II
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Introduction to Histograms Presented By: Laukik Chitnis
Mining Data Streams.
From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efficient Computation of Equi-Depth Histograms.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
On ‘Selection and Sorting with Limited Storage’ Graham Cormode Joint work with S. Muthukrishnan, Andrew McGregor, Amit Chakrabarti.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
MBG 1 PODS 04, June 2004 Power Conserving Computation of Order-Statistics over Sensor Networks Michael B. Greenwald & Sanjeev Khanna Dept. of Computer.
Evaluating Hypotheses
A survey on stream data mining
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
1 Query Optimization Vishy Poosala Bell Labs. 2 Outline Introduction Necessary Details –Cost Estimation –Result Size Estimation Standard approach for.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Hydrologic Statistics
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
EM and expected complete log-likelihood Mixture of Experts
CS573 Data Privacy and Security Statistical Databases
Access Path Selection in a Relational Database Management System Selinger et al.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos,
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Histograms for Selectivity Estimation
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Space-Efficient Online Computation of Quantile Summaries Michael Greenwald & Sanjeev Khanna University of Pennsylvania Presented by nir levy.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Sampling for Windows on Data Streams by Vladimir Braverman
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Dense-Region Based Compact Data Cube
Frequency Counts over Data Streams
Finding Frequent Items in Data Streams
Streaming & sampling.
Spatial Online Sampling and Aggregation
Lecture 4: CountSketch High Frequencies
Randomized Algorithms CS648
Range-Efficient Computation of F0 over Massive Data Streams
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Wavelet-based histograms for selectivity estimation
Approximation and Load Shedding Sampling Methods
(Learned) Frequency Estimation Algorithms
Presentation transcript:

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

2 Frequency Related Problems Find all elements with frequency > 0.1% Top-k most frequent elements What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Find elements that occupy 0.1% of the tail. Mean + Variance? Median? How many elements have non-zero frequency?

3 Types of Histograms... Equi-Depth Histograms –Idea: Select buckets such that counts per bucket are equal Count for bucket Domain values Count for bucket Domain values V-Optimal Histograms –Idea: Select buckets to minimize frequency variance within buckets

4 Histograms: Applications One Dimensional Data –Database Query Optimization [Selinger78] Selectivity estimation –Parallel Sorting [DNS91] [NowSort97] Jim Gray’s sorting benchmark –[PIH96] [Poo97] introduced a taxonomy, algorithms, etc. Multidimensional Data –OLTP: not much use (independent attribute assumption) –OLAP & Mining: yeah

5 Finding The Median... Exact median in main memory O(n) [BFPRT 73] Exact median in one pass n/2 [Pohl 68] Exact median in p passes O(n^(1/p)) [MP 80] 2 passes O(sqrt(n)) How about an approximate median?

6 Approximate Medians & Quantiles  -Quantile element with rank  N  0 <  < 1 (  = 0.5 means Median)  -Approximate  -quantile any element with rank  (    ) N  0 <  < 1 Typical  = 0.01 (1%)  -approximate median Multiple equi-spaced  -approximate quantiles = Equi-depth Histogram

7 Plan for Today... Greenwald-Khanna Algorithm for arbitrary length stream Munro-Paterson Algorithm for fixed N Sampling-based Algorithms for arbitrary length stream Randomized Algorithm for fixed N Randomized Algorithm for arbitrary length stream Generalization

8 Data distribution assumptions... Input sequence of ranks is arbitrary. e.g., warehouse data

9 Munro-Paterson Algorithm [MP 80] Munro-Paterson [1980] b = 4 b buffers, each of size k Memory = bk Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > N Max relative error in rank = b/2k <  b  log (  N) k  1/  log (  N) Memory = bk = How do we collapse two sorted buffers into one? Merge  Pick alternate elements Input: N and 

10 Error Propagation... S S S S S ? ? ? ? L L L L L L Depth d S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L L Depth d+1 Number of “?” elements <= 2x+1 x “?” elements 2x+1 “?” elements Top-down analysis

11 Error Propagation at Depth 0... S S S S S S S M L L L L L L L S S S S S S S S S S S S S S S M L L L L L L L L L L L L L L S S S S S S S S S S S L L L L S S S S M L L L L L L L L L L Depth 0 Depth 1

12 Error Propagation at Depth 1... S S S S S S S S S S L L L L L S S S S S S S S S S S S S S S S S S S S ? L L L L L L L L L S S S S S S S S S S S S L L L S S S S S S S S ? L L L L L L Depth 1 Depth 2

13 Error propagation at Depth 2... S S S S S S S S ? L L L L L L S S S S S S S S S S S S S S S S ? ? ? L L L L L L L L L L L S S S S S S S S ? L L L L L L S S S S S S S S ? ? L L L L L Depth 2 Depth 3

14 Error Propagation... S S S S S ? ? ? ? L L L L L L Depth d S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L L Depth d+1 Number of ? elements <= 2x+1 x “?” elements 2x+1 “?” elements

15 Error Propagation level by level Number of elements at depth d = k 2^d Increase in fractional error in rank is 1/2k per level Munro-Paterson [1980] b = 4 b buffers, each of size k Memory = bk Depth d = 2 Let sum of “?” elements at depth d be X Then fraction of “?” elements at depth d f = X / (k 2^d) Sum of “?” elements at depth d+1 is at most 2X+2^d Then fraction of “?” elements at depth d+1 f’ <= (2X + 2^d) / (k 2^(d+1)) = f + 1/2k Fractional error in rank at depth 0 is 0. Max depth = b So, total fractional error is <= b/2k Constraint 2: b/2k < 

16 Generalized Munro-Paterson [MRL 98] How do we collapse Buffers with different weights? Each buffer has a ‘weight’ associated with it.

17 Generalized Collapse Weight 6 Weight 2 Weight 3 Weight 1 k =

18 Analysis of Generalized Munro-Paterson Munro-Paterson Generalized Munro-Paterson - But smaller constant

19 Reservoir Sampling [Vitter 85] Maintain a uniform sample of size s If s =, then with probability at least 1- , answer is an  -approximate median Input Sequence of length N Sample of size s Approximate median = median of sample

20 “Non-Reservoir” Sampling A B D B A B D F A S C D D B A B D F A TX Y D B A X T F A S X Z D B A B D T G H Choose 1 out of every N/s successive elements N/s elements At end of stream, sample size is s Approximate median = median of sample If s =, then with probability at least 1- , answer is an  -approximate median

21 Non-uniform Sampling... A B D B A B D F A S C D D B A B D F A TX Y D B A X T F A S X Z D B A B D T G H... s out of s elements Weight = 1 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s =, then with probability at least 1- , answer is an  -approximate median s out of 2s elements Weight = 2 s out of 4s elements Weight = 4 s out of 8s elements Weight = 8

22 Sampling + Generalized Munro-Paterson [MRL 98] Advance knowledge of N Output is an  -approximate median with probability at least 1- . Reservoir Sampling Maintain samples. Memory required: Compute exact median of samples. Stream of unknown length,  and  “1-in-N/s” Sampling Choose s = samples. Generalized Munro-Paterson Compute -approximate median of samples Memory required = Stream of known length N,  and  Memory required:

23 Unknown-N Algorithm [MRL 99] Non-uniform Sampling Modified Deterministic Algorithm For Approximate Medians Stream of unknown length,  and  Output is an  -approximate median with probability at least 1- . Memory required:

24 Non-uniform Sampling... A B D B A B D F A S C D D B A B D F A TX Y D B A X T F A S X Z D B A B D T. … s out of s elements Weight = 1 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s =, then with probability at least 1- , answer is an  -approximate median s out of 2s elements Weight = 2 s out of 4s elements Weight = 4 s out of 8s elements Weight = 8 A B D E s out of s elements Weight = 1

25 Modified Deterministic Algorithm... h h+1 h+2 h+3 Height 2s elements with W = 1 L = highest level h = height of tree Sample Input s elements with W = 2 s elements with W = 4 s elements with W = 8 s elements with W = 2^(L-h) L Compute approximate median of weighted samples. b buffers, each of size k

26 Modified Munro-Paterson Algorithm Height Weighted Samples 2s elements with W = 1 H = highest level b = height of tree s elements with W = 2 s elements with W = 4 s elements with W = 8 s elements with W = 2^(H-b) Compute approximate median of weighted samples. b b+1 b+2 b+3 H b buffers, each of size k

27 Error Analysis... Weighted Samples 2s elements with W = 1 b+h = total height b = height of small tree s elements with W = 2 s elements with W = 4 s elements with W = 8 s elements with W = 2^(H-b) b b+1 b+2 b+3 b+h b buffers, each of size k Increase in fractional error in rank is 1/2k per level Total fractional error <=

28 Error Analysis contd... b  O(log (  s)) k  O(1/  log (  s)) Memory = bk = Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > s where s = Max fractional error in rank = b/k < (1-  )  Almost the same as before

29 Require advance knowledge of n. Summary of Algorithms... Reservoir Sampling [Vitter 85] –Probabilistic Munro-Paterson [MP 80] –Deterministic Generalized Munro-Paterson [MRL 98] –Deterministic Sampling + Generalized MP [MRL98] –Probabilistic Non-uniform Sampling + GMP [MRL 99] –Probabilistic Greenwald & Khanna [GK 01] –Deterministic

30 V-OPT Histograms Given : Vector V = {v 1, v 2, …,v N } –Frequency count vector Goal : Partition into k contiguous buckets {(s 1 = 1, e 1 ), (s 2 = e 1 +1, e 2 ), …,(s i = e i-1, e i ), …, (s k, e k = N)} such that Err = Σ i err i is minimized err i (error for the ith bucket) = Σ j=si to ei (v j – μ i ) 2 –μ i = (Σ j=si to ei v j )/(e i – s i +1) –Minimize sum of inter-bucket variance –Good for point queries (represent each bucket with its mean) Observe : err i = Σ v j 2 - (e i – s i +1) μ i 2

31 Dynamic Programming Dynamic Programming table: –T(i,j) = Error for OPT partition with j buckets for V[1…i] –‘i’ ranges from 1 to N –‘j’ ranges from 1 to k T(i,j+1) = min m < i T(m,j) + err (m+1, i) –err(m+1, i) : error in bucket with s = m+1 and e = i –Check for all m < i –T(i,1) is just variance of first i values Gives a O(N 2 k) time algorithm that uses O(Nk) space –Provided : Given indices s,e can calculate err(s,e) in O(1) time

32 Dynamic Programming (contd.) Define : S(j) = Σ i = 1 to j v i –Prefix Sum vector –j ranges from 1 to N. Define : SS(j) = Σ i = 1 to j v i 2 –Prefix “Sum squares” vector –j ranges from 1 to N.

33 List of papers... [Hoeffding63] W Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables”, Amer. Stat. Journal, p 13-30, 1963 [MP80] J I Munro and M S Paterson, “Selection and Sorting in Limited Storage”, Theoretical Computer Science, 12: , [Vit85] J S Vitter, “Random Sampling with a Reservoir”, ACM Trans. on Math. Software, 11(1):37-57, [MRL98] G S Manku, S Rajagopalan and B G Lindsay, “Approximate Medians and other Quantiles in One Pass and with Limited Memory”, ACM SIGMOD 98, p , [MRL99] G S Manku, S Rajagopalan and B G Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets”, ACM SIGMOD 99, pp , [GK01] M Greenwald and S Khanna, “Space-Efficient Online Computation of Quantile Summaries”, ACM SIGMOD 2001, p 58-66, 2001.