Presentation on theme: "Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles."— Presentation transcript:
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles of Database Systems (PODS 2003), June 2003 Presented by C.-L. Lin
Data streams data sets, one time query, relativelyTraditional DBMS( Data Base Management System) – data stored in finite and persistent data sets, one time query, relatively low update rate. data streams, continuous queries..Data Stream Management System (DSMS) - data input as continuous, possibly infinite data streams, continuous queries.. etc. –An example of continuous query : In a telecom company, we are interested in finding all outgoing calls longer than 2 minutes –New Applications Sensor networks Network monitoring and traffic engineering Telecom call records Network security Financial applications Manufacturing processes Web logs and clickstreams Massive data sets
Data Stream Management System (DSMS) Query Query Results (Limited Memory and/or Disk) Summary data (sum, count, Variance…) 713532161332 User/Application Stream Query Processor
Sliding Window Model ….2 14 14 15 11 6 7 4 3 47 14 15 7 5 10 11 4 1 21 1 4 7 … Time Increases Current Time Window Size = N 1. When N is large ( many hours, days and months), we cannot buffer the entire sliding window in memory. O(N log R) bits of memory is required, where R is the upper bound on the absolute value of the data. So we cannot compute the sum, count, variance exactly at every instant. 2.Approximately compute variance over sliding window, and use as small memory as possible. Future dataExpired data Timestamps 7 6 5 4 3 2 1
Review (1) Mean (2) Variance (3) Relative estimation error
The Concept of Buckets (2/2) Time Window size N B1B1 B2B2 B3B3 B m-1 BmBm B m* (1)For each bucket B i, maintain (2) Proof later
Estimated Variance Time Window size N B1B1 B2B2 B3B3 B m-1 BmBm B m* Error!!!
Lemma 1 Proof. Define δ i =μ i -μ i,j δ j =μ j -μ i,j
When a new x t element arrives.. (1)create a new bucket for x t. The new bucket becomes B 1 with V 1 =0, μ 1 = x t, n 1 =1. An old bucket B i becomes B i+1. (2)if t m > N, delete the bucket. Bucket B m-1 becomes the new oldest bucket. update B m-1*
Bucket Merge Invariant 1 For every bucket B i, –Ensures that the relative error is ≤ ε Invariant 2 For each i<1, for every bucket B i, –This invariant insures that the total number of buckets is small O((1/ε 2 )log NR 2 )
Number of Buckets Lemma 2: The number of buckets maintained at any point in time by an algorithm that preserves Invariant 2 is O(1/ε 2 logNR 2 ) where R is an upper bound on the absolute value of the data elements. From the merge rule : the variance of the union of two buckets is no less then the sum of the individual variances. By invariant 2, the variance of the suffix bucket B i* doubles after every O(1/ε 2 ) buckets. Total number of buckets: no more then O(1/ε 2 logV) where V is the variance of the last N points. V is no more than NR 2. O(1/ε 2 log NR 2 ) V 3* V 3,4 V 5*
Space Complexity (1) By lemma 2, the number of buckets maintained at any point in time by an algorithm is O(1/ε 2 logNR 2 ) (2) Each bucket requires constant space : ==> Overall memory is O(1/ε 2 logNR 2 ) But………… (1)Timestamps : O(logN) (2)Bucket size : (3)Mean: (4)Variance: O(logV) = O(logNR 2 )
Estimation Error Estimated Variance: Actual Variance: Error: (1)(2) (3)
Your consent to our cookies if you continue to use this website.