Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.

Similar presentations


Presentation on theme: "How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003."— Presentation transcript:

1 How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

2 Problem Definition ► The Universe: U = {0, …, |U |-1} ► Number of records in data set: ||A||=N ► Data set can be thought of as an array: A[i] – number of records with value i ► A S – number of records with values in S ► The Ф-quantile of an ordered sequence of N data items are the value with rank ► Our goal is computing ε-approximate Ф-quantiles – find a j k such that:

3

4 Transactions ► Insert(i): A[i]  A[i] + 1 ► Delete(i): A[i]  A[i] – 1 ► Let ► ASSUME: The Universe size |U| is known

5 The Main Algorithmic Result ► The RSS Algorithm ► Space Complexity ► Update In every transaction in O(space) time ► Estimation On demand in O(space) time ► One Time pass

6 Dyadic Intervals ► Log(|U|)+1 resolution levels j ► 2|U|-1 Dyadic intervals 01234567 I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)

7 Arbitrary intervals ► Any Interval can be displayed as a disjoint union of at most log(|U|) dyadic intervals ► For example A[0,6] = I(1,0)+I(2,2)+I(3,6) ► Intervals starting at 0 will not use the same resolution twice 01234567 I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)

8 Computing quantiles ► Assuming we have the number of records in each dyadic interval, We can efficiently compute any arbitrary interval in A. ► To compute the ф-quantile for any k, we need a j k s.t.: A[0,j k ) < kФN < A[0,j k+1 ) ► Use binary search to find it. ► Keeping all intervals is costly (O(|U|))

9 Random Subset Sums ► In case j = log(|U|) ► Let S be a subset of U ► Each u  U has p=½ of being in S ► E(|S|)= ½|U| ► Define: ► E(|A S |)=½||A||=½N

10 Estimating A[i]

11 Improvement ► Instead of keeping sets of point dyadic sets, Keep random sets of all resolutions ► We need a method of keeping a Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|) ► Instead of keeping the sets keep a small representation of them

12 Pseudorandom set generator ► We need to keep a small representation of a random set S (Ui  S with p= ½) ► Given a seed of size log(|U|)+1 ► Represent a set S of size o(|U|) ► Quickly test if i  S or not ► Use Extended Hamming Code

13 Extended Hamming Code ► Given a seed, tells whether the i  S ► For example:  |U| = 8  Seed size: log|U|+1 = 4  G(seed, i) = seed X i’th column mod 2 ► Efficient to compute ► 3-wise disjoint

14 The Data Structure ► For each resolution level j keep num_copies random subsets S of all dyadic intervals in that level (we only keep the representation seed) ► Keep ► Maintain N = ||A|| ► We got S 1,…,S num_copies per level

15 Upon Transactions ► Insert(i) / Delete(i)  For Each resolution level j ► Locate the single I j,k into which i falls (high order binary bits) ► Determine all S ℓ containing I j,k ► For Each S ℓ increase/Decrease ||A S ℓ || by 1

16 Estimating Quantiles: Dyadic Intervals ► Given a dyadic interval I=I j,k ► There are num_copies sets of resolution j G E G E ► Quickly test each S ℓ and check if I  S ℓ and if so estimate ► Group all estimations into G groups of E elements ► For each group g calculate the average of all estimations A g,j,k

17 Estimating Quantiles: Arbitrary intervals ► Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals I j,k ► Form G groups and calculate each group’s sum of all dyadic interval’s A g,j,k for all I j,k comprising I. ► Take the median of all G groups as the final estimate of A I ► Its more convenient to refer to the result as an overestimate |A I |≤|A I | ~ ≤|A I |+εN

18 3 dyadic intervals E = 4 Elements per group G = 3 Groups SUM AVERAGE MEDIAN The Interval’s Estimate

19 Analysis ► Lemma: The algorithm estimates each quantile to within εN with p>1-δ ► Proof:  For a fixed resolution level j, Let  Then:

20

21 Analysis (cont.)

22 ► We take G copies of Z and take the median. ► By the Chernoff inequality, ► The binary search looked for a j k such that ► We made log|U| checks in the binary search ► The probability any of them failed is log|U| times what we achieved, i.e δ

23 RSS Properties ► The algorithm may return a quantile value which was not seen in the input ► Changing the order of insertions and deletions doesn’t affect results ► The RSSs are composable: U can be split to many disjoint ranges and some pre-agreed common random subsets

24 Extension: U is unknown ► Predict a range [0, u-1] for U. ► Upon insertion of i > u-1, add another instance of RSS with range [u, u 2 -1], and so on… ► Because RSS is composable, we only have to join the result upon query ► Increased cost factor: log 2 log(|U|).

25 Experiments ► What is the median length of all active AT&T calls ? ► When call  Starts: Add timestamp  Ends: Delete start timestamp ► 4 KB used for RSS ► Compared  RSS  GK  GK2

26 Number of Active Phone Calls Over Time

27 Error in Computation of Median Over Time

28 Average Error for Last 50 Snapshots, For Deciles

29 The End


Download ppt "How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003."

Similar presentations


Ads by Google