1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku (manku@cs.stanford.edu)

2 Frequency Related Problems... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% Top-k most frequent elements What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Find elements that occupy 0.1% of the tail. Mean + Variance? Median? How many elements have non-zero frequency?

3 Types of Histograms... Equi-Depth Histograms –Idea: Select buckets such that counts per bucket are equal Count for bucket Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Count for bucket Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 V-Optimal Histograms –Idea: Select buckets to minimize frequency variance within buckets

4 Histograms: Applications One Dimensional Data –Database Query Optimization [Selinger78] Selectivity estimation –Parallel Sorting [DNS91] [NowSort97] Jim Gray’s sorting benchmark –[PIH96] [Poo97] introduced a taxonomy, algorithms, etc. Multidimensional Data –OLTP: not much use (independent attribute assumption) –OLAP & Mining: yeah

5 Finding The Median... Exact median in main memory O(n) [BFPRT 73] Exact median in one pass n/2 [Pohl 68] Exact median in p passes O(n^(1/p)) [MP 80] 2 passes O(sqrt(n)) How about an approximate median?

6 Approximate Medians & Quantiles  -Quantile element with rank  N  0 <  < 1 (  = 0.5 means Median)  -Approximate  -quantile any element with rank  (    ) N  0 <  < 1 Typical  = 0.01 (1%)  -approximate median Multiple equi-spaced  -approximate quantiles = Equi-depth Histogram

7 Plan for Today... Greenwald-Khanna Algorithm for arbitrary length stream Munro-Paterson Algorithm for fixed N Sampling-based Algorithms for arbitrary length stream Randomized Algorithm for fixed N Randomized Algorithm for arbitrary length stream Generalization

8 Data distribution assumptions... Input sequence of ranks is arbitrary. e.g., warehouse data

9 Munro-Paterson Algorithm [MP 80] Munro-Paterson [1980] 11 2 11 2 3 11 2 11 2 3 4 b = 4 b buffers, each of size k Memory = bk Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > N Max relative error in rank = b/2k <  b  log (  N) k  1/  log (  N) Memory = bk = How do we collapse two sorted buffers into one? Merge  Pick alternate elements Input: N and 

10 Error Propagation... S S S S S ? ? ? ? L L L L L L Depth d S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L L Depth d+1 Number of “?” elements <= 2x+1 x “?” elements 2x+1 “?” elements Top-down analysis

11 Error Propagation at Depth 0... S S S S S S S M L L L L L L L S S S S S S S S S S S S S S S M L L L L L L L L L L L L L L S S S S S S S S S S S L L L L S S S S M L L L L L L L L L L Depth 0 Depth 1

12 Error Propagation at Depth 1... S S S S S S S S S S L L L L L S S S S S S S S S S S S S S S S S S S S ? L L L L L L L L L S S S S S S S S S S S S L L L S S S S S S S S ? L L L L L L Depth 1 Depth 2

13 Error propagation at Depth 2... S S S S S S S S ? L L L L L L S S S S S S S S S S S S S S S S ? ? ? L L L L L L L L L L L S S S S S S S S ? L L L L L L S S S S S S S S ? ? L L L L L Depth 2 Depth 3

14 Error Propagation... S S S S S ? ? ? ? L L L L L L Depth d S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L L Depth d+1 Number of ? elements <= 2x+1 x “?” elements 2x+1 “?” elements

15 Error Propagation level by level Number of elements at depth d = k 2^d Increase in fractional error in rank is 1/2k per level Munro-Paterson [1980] 33 2 33 2 1 33 2 33 2 1 0 b = 4 b buffers, each of size k Memory = bk Depth d = 2 Let sum of “?” elements at depth d be X Then fraction of “?” elements at depth d f = X / (k 2^d) Sum of “?” elements at depth d+1 is at most 2X+2^d Then fraction of “?” elements at depth d+1 f’ <= (2X + 2^d) / (k 2^(d+1)) = f + 1/2k Fractional error in rank at depth 0 is 0. Max depth = b So, total fractional error is <= b/2k Constraint 2: b/2k < 

16 Generalized Munro-Paterson [MRL 98] How do we collapse Buffers with different weights? Each buffer has a ‘weight’ associated with it.

17 Generalized Collapse... 31 37 6 12 510 35 8 19 13 28 15 16 25 27 6 10 15 27 35 5 5 6 6 8 8 8 10 10 10 12 12 13 13 13 15 16 19 19 19 25 27 28 31 31 35 35 35 37 37 Weight 6 Weight 2 Weight 3 Weight 1 k = 5 31 31 37 37 6 6 12 12 5 510 10 10 35 35 35 8 8 8 19 19 19 13 13 1328 15 16 25 27

18 Analysis of Generalized Munro-Paterson Munro-Paterson Generalized Munro-Paterson - But smaller constant

19 Reservoir Sampling [Vitter 85] Maintain a uniform sample of size s If s =, then with probability at least 1- , answer is an  -approximate median Input Sequence of length N Sample of size s Approximate median = median of sample

20 “Non-Reservoir” Sampling A B D B A B D F A S C D D B A B D F A TX Y D B A X T F A S X Z D B A B D T G H Choose 1 out of every N/s successive elements N/s elements At end of stream, sample size is s Approximate median = median of sample If s =, then with probability at least 1- , answer is an  -approximate median

21 Non-uniform Sampling... A B D B A B D F A S C D D B A B D F A TX Y D B A X T F A S X Z D B A B D T G H... s out of s elements Weight = 1 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s =, then with probability at least 1- , answer is an  -approximate median s out of 2s elements Weight = 2 s out of 4s elements Weight = 4 s out of 8s elements Weight = 8

22 Sampling + Generalized Munro-Paterson [MRL 98] Advance knowledge of N Output is an  -approximate median with probability at least 1- . Reservoir Sampling Maintain samples. Memory required: Compute exact median of samples. Stream of unknown length,  and  “1-in-N/s” Sampling Choose s = samples. Generalized Munro-Paterson Compute -approximate median of samples Memory required = Stream of known length N,  and  Memory required:

23 Unknown-N Algorithm [MRL 99] Non-uniform Sampling Modified Deterministic Algorithm For Approximate Medians Stream of unknown length,  and  Output is an  -approximate median with probability at least 1- . Memory required:

24 Non-uniform Sampling... A B D B A B D F A S C D D B A B D F A TX Y D B A X T F A S X Z D B A B D T. … s out of s elements Weight = 1 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s =, then with probability at least 1- , answer is an  -approximate median s out of 2s elements Weight = 2 s out of 4s elements Weight = 4 s out of 8s elements Weight = 8 A B D E s out of s elements Weight = 1

25 Modified Deterministic Algorithm... h h+1 h+2 h+3 Height 2s elements with W = 1 L = highest level h = height of tree Sample Input s elements with W = 2 s elements with W = 4 s elements with W = 8 s elements with W = 2^(L-h) L Compute approximate median of weighted samples. b buffers, each of size k

26 Modified Munro-Paterson Algorithm Height Weighted Samples 2s elements with W = 1 H = highest level b = height of tree s elements with W = 2 s elements with W = 4 s elements with W = 8 s elements with W = 2^(H-b) Compute approximate median of weighted samples. b b+1 b+2 b+3 H b buffers, each of size k

27 Error Analysis... Weighted Samples 2s elements with W = 1 b+h = total height b = height of small tree s elements with W = 2 s elements with W = 4 s elements with W = 8 s elements with W = 2^(H-b) b b+1 b+2 b+3 b+h b buffers, each of size k Increase in fractional error in rank is 1/2k per level Total fractional error <=

28 Error Analysis contd... b  O(log (  s)) k  O(1/  log (  s)) Memory = bk = Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > s where s = Max fractional error in rank = b/k < (1-  )  Almost the same as before

29 Require advance knowledge of n. Summary of Algorithms... Reservoir Sampling [Vitter 85] –Probabilistic Munro-Paterson [MP 80] –Deterministic Generalized Munro-Paterson [MRL 98] –Deterministic Sampling + Generalized MP [MRL98] –Probabilistic Non-uniform Sampling + GMP [MRL 99] –Probabilistic Greenwald & Khanna [GK 01] –Deterministic

30 V-OPT Histograms Given : Vector V = {v 1, v 2, …,v N } –Frequency count vector Goal : Partition into k contiguous buckets {(s 1 = 1, e 1 ), (s 2 = e 1 +1, e 2 ), …,(s i = e i-1, e i ), …, (s k, e k = N)} such that Err = Σ i err i is minimized err i (error for the ith bucket) = Σ j=si to ei (v j – μ i ) 2 –μ i = (Σ j=si to ei v j )/(e i – s i +1) –Minimize sum of inter-bucket variance –Good for point queries (represent each bucket with its mean) Observe : err i = Σ v j 2 - (e i – s i +1) μ i 2

31 Dynamic Programming Dynamic Programming table: –T(i,j) = Error for OPT partition with j buckets for V[1…i] –‘i’ ranges from 1 to N –‘j’ ranges from 1 to k T(i,j+1) = min m < i T(m,j) + err (m+1, i) –err(m+1, i) : error in bucket with s = m+1 and e = i –Check for all m < i –T(i,1) is just variance of first i values Gives a O(N 2 k) time algorithm that uses O(Nk) space –Provided : Given indices s,e can calculate err(s,e) in O(1) time

32 Dynamic Programming (contd.) Define : S(j) = Σ i = 1 to j v i –Prefix Sum vector –j ranges from 1 to N. Define : SS(j) = Σ i = 1 to j v i 2 –Prefix “Sum squares” vector –j ranges from 1 to N.

33 List of papers... [Hoeffding63] W Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables”, Amer. Stat. Journal, p 13-30, 1963 [MP80] J I Munro and M S Paterson, “Selection and Sorting in Limited Storage”, Theoretical Computer Science, 12:315-323, 1980. [Vit85] J S Vitter, “Random Sampling with a Reservoir”, ACM Trans. on Math. Software, 11(1):37-57, 1985. [MRL98] G S Manku, S Rajagopalan and B G Lindsay, “Approximate Medians and other Quantiles in One Pass and with Limited Memory”, ACM SIGMOD 98, p 426-435, 1998. [MRL99] G S Manku, S Rajagopalan and B G Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets”, ACM SIGMOD 99, pp 251-262, 1999. [GK01] M Greenwald and S Khanna, “Space-Efficient Online Computation of Quantile Summaries”, ACM SIGMOD 2001, p 58-66, 2001.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Similar presentations

Presentation on theme: "1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Similar presentations

Presentation on theme: "1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku"— Presentation transcript:

Similar presentations

About project

Feedback