Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sampling for Windows on Data Streams by Vladimir Braverman

Similar presentations


Presentation on theme: "Sampling for Windows on Data Streams by Vladimir Braverman"— Presentation transcript:

1 Sampling for Windows on Data Streams by Vladimir Braverman vova@cs.ucla.edu

2 Data Stream Sequence of elements D=p 1,p 2,…,p N p i is drown from [m]. Objective: Calculate a function f(D). Restrictions: single pass, sub-linear memory, fast processing time (per element). p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN … p N-6 Time

3 Motivation Today’s applications: Huge amounts of data is whizzing by Objective Mining the data, computing statistics etc. Restrictions Expensive overload is not allowed Useful for many applications Networking, databases etc.

4 Data Stream Intensive theoretical research Streaming Systems Stream(Stanford), StreamMill (UCLA), Aurora (Brown), GigaScope (Rutgers), Nile (Purdue), Niagara (Wisconsin), Telegraph (Berkley) etc.

5 Data Stream The model allows insertions only What about deletions? Turnstile model Sliding Windows p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN … p N-6 Time

6 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 n=5 Sliding Windows expired active SW contains n most recent elements that are “active”. Older elements are “expired”.

7 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time p N-6 p N-5 p N-4 p N-3 …. p N-2 p N-1 pNpN p N-7 n=5 Sliding Windows expired active SW contains n most recent elements that are “active”. Older elements are “expired”.

8 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time p N-6 p N-5 p N-4 p N-3 …. p N-2 p N-1 pNpN p N-7 n=5, n is “huge” Sequence-based Windows expired active SW contains n most recent elements that are “active”. Older elements are “expired”.

9 p1p1 p2p2 p3p3 p4p4 p5p5 Time Timestamp-based windows p6p6 p7p7 p8p8 p9p9 p 10 p 11 p 12 p 13

10 What is known on sliding windows [BDM 02]Random sampling [DGIM 02]Sum, Count, average, Lp, 0<p≤2, weakly additive functions. [DM 02]Rarity, similarity [GT 02]Distributed sum, count [FKZ 02], [CS 04]Diameter [BDMO 03]Variance, k-medians [GDDLM 03]Frequent elements [AM 04]Counts, quantiles [AGHLRS 04]LIS [LT 06]Frequent items [LT 06]Count [ZG 06]Variance [CCM 07]Entropy

11 Random Sampling

12 Fundamental approximation method Pick a subset S of D Use f(S) to approximate f(D) p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N-6 p9p9 p 10

13 Types of k-sampling With replacement Samples x 1,…,x k are independent Without replacement Repetitions are forbidden, i.e., x i ≠ x j

14 Properties of Random Sampling General, simple, first-to-try method Stores an element, not aggregation Allows to change f a posteriori. Can be used for multiple statistics. Provides effective solutions with worst- case guarantees The only known solution for many problems

15 Some Known Methods for Data Streams Reservoir Sampling[V 85] Concise Sampling[GM 98] Inverse Sampling[CMR 05] Weighted Sampling[CMN 99] Biased Sampling[A 06] Priority Sampling[ADLT 05] Dynamic Sampling[FIS 05] Chain Sampling[BDM 02]

16 Streaming Sampling Easy if N is fixed Pick random index I from {1,2,…,N} Output p I But: N is not known in advance Naïve methods Store the whole stream Linear memory “Guess” the final value of N Not really uniform

17 Reservoir Sampling (Vitter 85) Maintains k uniform samples without replacement using Θ(k) space. Outputs sample for every prefix Intuition: The probability to pick p decreases as N grows  probabilities can be adjusted dynamically

18 Reservoir Sampling (Vitter 85) Reservoir (array) of k elements, initially empty Algorithm: Insert k first elements into the reservoir. For i>k, pick p i with probability 1/i If p i is chosen Pick one of samples in the reservoir randomly Replace it with p i

19 Sampling on Sliding Windows: Problem Definition Maintain uniform random sampling on sliding windows Output a sample for every window Use provably optimal memory

20 Sampling for Sliding Windows Can we use previous methods? No - samples expire p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time n=5

21 Naïve Approach Store the whole window Linear memory => compute f(W) directly

22 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time n=5 Periodic Sampling Pick a sample p i from the first window When p i expires, take the new element Continue…

23 Periodic Sampling: problems Vulnerability to malicious behavior Given one sample, it is possible to predict all future samples Poor representation of periodic data If the period “agrees” with the sample Unacceptable for applications

24 Sampling on Sliding Windows: Problem Definition Maintain uniform random sampling on sliding windows Use provably optimal memory Samples on distinct windows are independent

25 Chain and Priority Methods Babcock, Datar, Motwani, SODA 2002. Maintain uniform random sampling on sliding windows Chain Sampling Sequence-based windows, with replacement. Uses optimal memory in expectation Uses O(k log{n}) w.h.p. Samples on distinct windows are weakly dependent Priority Sampling Timestamp-based windows, with replacement. Uses optimal memory in expectation and w.h.p. Samples on distinct windows are independent

26 S 3 Algorithms Maintain uniform random sampling on sliding windows Supports all cases Provably optimal Samples on distinct windows are independent

27 Sequence-basedTimestamp-based With Replacement Without Replacement Window Sampling Taxonomy

28 Sampling With Replacement on Sequence-Based Windows SamplingMemoryDependency NaïveO(n)No PeriodicO(k)Yes Chain (BDM 02) O(k) in expectation Weak S 3 (our result) O(k)No

29 Sampling Without Replacement on Sequence-Based Windows SamplingMemoryDependency NaïveO(n)No PeriodicO(k)Yes S3S3 O(k)No

30 Sampling With Replacement on Time-Based Windows SamplingMemoryDependency NaïveO(n)No Priority (BDM 02) O(k log(n)) w.h.p. No S3S3 O(k log(n))No

31 Sampling Without Replacement on Time-Based Windows SamplingMemoryDependency NaïveO(n)No S3S3 O(k)No

32 Sequence-basedTimestamp-based With Replacement O(k)O(k*log n) Without Replacement O(k) O(k*log n) Window Sampling S 3 : Recap

33 Concepts Prior algorithms: Replacement policy for expired samples S 3 algorithms: Divide stream into buckets Sample(s) for each bucket Combination rule

34 Sampling With Replacement for Sequence-Based Windows

35 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/n B N/n+1 p N+2 p N+3 Active element Bucket Expired elementFuture element Notations

36 The Algorithm (for one sample) Divide D into buckets of size n Maintain random sample for each bucket (reservoir algorithm) Combine samples of buckets that have active elements: There are at most two such buckets p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/n B N/n+1 p N+2 p N+3 R1R1 R2R2 R N/n R N/n+1 Time

37 p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 Time …. R1R1 R2R2 X

38 p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/M B N/M+1 p N+2 p N+3 Time …. X Case 1

39 p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 Time …. R1R1 R2R2 X Case 2

40 Sampling Without Replacement for Sequence-Based Windows

41 The Algorithm Divide D into buckets of size n Maintain k random samples for each bucket Combine samples of buckets that have active elements: p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/M B N/M+1 p N+2 p N+3 R 1,1 R 1,2 R 2,1 R 2,2 R 2,1 R 2,2 R 2,1 R 2,2 Time k=2

42 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 R 1,1 R 1,2 R 2,1 R 2,2 Time …. R 1,1 R 2,2 X= R1=R1=R2=R2=

43 Sampling With Replacement for Timestamp-Based Windows

44 Timestamp-based window n is unknown! Can be changed arbitrary Does our concept work? How to divide stream into buckets? How to combine samples?

45 AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 n=13 a=|A|=5b=|B|=10 What if we can maintain buckets A, B as before Samples from A and B a=|A|, b=|B|, c=|A ∩ W| If sample from A expired, X = sample from B If sample from A is active, X= sample from A with probability a/n Otherwise X= sample from B c= |A∩W|=3 The main idea, revised

46 AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 n=13 a=|A|=5b=|B|=10 c= |A∩W|=3 Correctness

47 AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 n=13 a=|A|=5b=|B|=10 The combination rule works if: 1. a ≤ n 2. It is possible to generate events w.p. a/M c= |A∩W|=3 Conclusions

48 The First Problem How to maintain A, B at any moment? |A| is less then n

49 The solution: ζ-decomposition List of buckets B 1,…,B s Contain all active elements 2 samples from each buckets B 1 may contain expired elements as well B1B1 B2B2 B3B3 B4B4 B s-1 BsBs …… Define Ensure that |A| ≤ |B| and s = O(log n)

50 ζ-decomposition : implementation Similar idea to smooth histograms Slightly different structure

51 AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 M=13 a=|A|=5 c= |A∩W|=3 b=|B|=10 Assuming a ≤ b ≤ n, how to generate events w.p. a/n? a,b are known, c is unknown and n=b+c The Second Problem

52 Approach Generate “biased” sample Y on A, using such that Y expires w.p. b/n Use Y to obtain probability a/n The details are in the paper

53 AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 M=13 a=|A|=5 c= |A∩W|=3 b=|B|=10 Given random sample from A, it is possible to construct random variable Y on A such that Lemma 1

54 Generate random vector V on D = Ax{0,1} a V = of independent random variables Q, H 1,…,H a Q ~ U(A) H i = 1 w.p. ab/(b+i)(b+i+1) Define a set of subspaces of D: A i = {p N-b-I } x {0,1} i-1 x {1} x {0,1} a-i

55 Lemma 2 Given Y from Lemma 1, it is possible to construct 0-1 random variable Z such that P(Z=1) = a/n Proof sketch: - Generate event T that happens w.p. a/b It is possible since a ≤ b and a,b are known

56 Sampling Without Replacement for Timestamp-Based Windows

57 Main idea Implement k-sample without replacement using k independent samples What can we do if the same point is sampled more then once? Approach: sample from different domains

58 Cascading lemma H i j j-sample (without replacement) from {1,…,i} Given H i j and H i+1 1, we can construct H i+1 j+1.

59 Cascading Lemma (Illustration) H 1 n-k+1 H 1 n-k+2 H 1 n-k+3 H 1 n-k+4 H 1 n-1 H1nH1n ….. H 2 n-k+2 H 3 n-k+3 H 4 n-k+4 H k-1 n-1 HknHkn

60 Conclusions Random Sampling Optimally solved Gives worst-case solutions for many problems

61 Thank you!


Download ppt "Sampling for Windows on Data Streams by Vladimir Braverman"

Similar presentations


Ads by Google