Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.

Slides:

Advertisements

Similar presentations

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.

Advertisements

CS4432: Database Systems II

Fast Algorithms For Hierarchical Range Histogram Constructions

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,

1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

Mining Data Streams.

Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Approximating Average Parameters of Graphs Oded Goldreich, Weizmann Institute Dana Ron, Tel Aviv University.

Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

UMass Lowell Computer Science Graduate Analysis of Algorithms Prof. Karen Daniels Spring, 2010 Lecture 3 Tuesday, 2/9/10 Amortized Analysis.

Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)

The Growth of Functions

Extension of DGIM to More Complex Problems

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.

CURE Algorithm Clustering Streams

1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.

UMass Lowell Computer Science Graduate Analysis of Algorithms Prof. Karen Daniels Spring, 2009 Lecture 3 Tuesday, 2/10/09 Amortized Analysis.

Stream Data Introduction or “Stream Data in 30 minutes or less…” Magdiel Galán CSE591: DataMining Dr. Huan Liu Spring 2004.

A survey on stream data mining

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore.

Analysis of Algorithms 7/2/2015CS202 - Fundamentals of Computer Science II1.

CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Sorting CS 202 – Fundamental Structures of Computer Science II Bilkent.

1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.

1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

1 Summary of lectures 1.Introduction to Algorithm Analysis and Design (Chapter 1-3). Lecture SlidesLecture Slides 2.Recurrence and Master Theorem (Chapter.

Chapter 3 Sec 3.3 With Question/Answer Animations 1.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.

Stream Data Introduction

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

1 Combinatorial Algorithms Local Search. A local search algorithm starts with an arbitrary feasible solution to the problem, and then check if some small,

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.

Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck.

Data reduction for weighted and outlier-resistant clustering

Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Sampling for Windows on Data Streams by Vladimir Braverman

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.

Copyright © 2014 Curt Hill Algorithm Analysis How Do We Determine the Complexity of Algorithms.

Clustering Data Streams A presentation by George Toderici.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Mining Data Streams (Part 1)

Clustering Data Streams

The Stream Model Sliding Windows Counting 1’s

Finding Frequent Items in Data Streams

Load Shedding Techniques for Data Stream Systems

k-center Clustering under Perturbation Resilience

Near-Optimal (Euclidean) Metric Compression

Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

Range-Efficient Computation of F0 over Massive Data Streams

CS 188: Artificial Intelligence Fall 2008

Maintaining Stream Statistics over Sliding Windows

Presentation transcript:

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University

Data Streams and Sliding Windows Streaming data model Useful for applications with high data volumes, timeliness requirements Data processed in single pass Limited memory (sublinear in stream size) Sliding window model Variation of streaming data model Only recent data matters Parameterized by window size N Limited memory (sublinear in window size)

Sliding Window (SW) Model … … Time Increases Current Time Window Size N = 7

Variance and k-Medians Variance: Σ(x i – μ) 2, μ = Σ x i /N k-median clustering: Given: N points (x 1… x N ) in a metric space Find k points C = {c 1, c 2, …, c k } that minimize Σ d(x i, C) (the assignment distance)

Previous Results in SW Model Count of non-zero elements / Sum of positive integers [DGIM’02] (1 ± ε) approximation Space: θ((1/ε)(log N)) words θ((1/ε)(log 2 N)) bits Update time: θ(log N) worst case, θ(1) amortized Improved to θ(1) worst case by [GT’02] Exponential Histogram (EH) data structure Generalized SW model [CS’03] (previous talk)

Results – Variance (1 ± ε) approximation Space: O((1/ε 2 ) log N) words Update Time: O(1) amortized, O((1/ε 2 ) log N) worst case

Results – k-medians 2 O(1/ τ ) approximation of assignment distance (0 < τ < ½) Space: O((k/ τ 4 )N 2 τ ) Update time: O(k) amortized, O((k 2 / τ 3 )N 2 τ ) worst case Query time: O((k 2 / τ 3 )N 2 τ ) ~ ~ ~ ~

Remainder of the Talk Overview of Exponential Histogram Where EH fails and how to fix it Algorithm for Variance Main ideas in k-medians algorithm Open problems

Sliding Window Computation Main difficulty: discount expiring data As each element arrives, one element expires Value of expiring element can’t be known exactly How do we update our data structure? One solution: Use histograms … … Bucket Sums = {3,2,1,2} Bucket Sums = {2,1,2}

Containing the Error Error comes from last bucket Need to ensure that contribution of last bucket is not too big Bad example: … … Bucket Sums = {4,4,4}Bucket Sums = {4}

Exponential Histograms Exponential Histogram algorithm: Initially buckets contain 1 item each Merge adjacent buckets once the sum of later buckets is large enough … … Bucket sums = {4, 2, 2, 1}Bucket sums = {4, 2, 2, 1, 1}Bucket sums = {4, 2, 2, 1, 1,1}Bucket sums = {4, 2, 2, 2, 1}Bucket sums = {4, 4, 2, 1}

Where EH Goes Wrong [DGIM’02] Can estimate any function f defined over windows that satisfies: Positive: f(X) ≥ 0 Polynomially bounded: f(X) ≤ poly(|X|) Composable: Can compute f(X +Y) from f(X), f(Y) and little additional information Weakly Additive: (f(X) + f(Y)) ≤ f(X +Y) ≤ c(f(X) + f(Y)) “Weakly Additive” condition not valid for variance, k-medians

Notation V i = Variance of the ith bucket n i = number of elements in ith bucket μ i = mean of the ith bucket B1B1 B m B2B2 ……………… Current window, size = N B m-1

Variance – composition B i,j = concatenation of buckets i and j

Failure of “Weak Additivity” Time Value Variance of each bucket is small Variance of combined bucket is large Cannot afford to neglect contribution of last bucket

Main Solution Idea More careful estimation of last bucket’s contribution Decompose variance into two parts “Internal” variance: within bucket “External” variance: between buckets Internal Variance of Bucket i Internal Variance of Bucket j External Variance

Main Solution Idea When estimating contribution of last bucket: Internal variance charged evenly to each point External variance Pretend each point is at the average for its bucket Variance for bucket is small  points aren’t too far from the average Points aren’t far from the average  average is a good approx. for each point

Main Idea – Illustration Time Value Spread Spread is small  external variance is small Spread is large  error from “bucket averaging” insignificant

Variance – error bound Theorem: Relative error ≤ ε, provided V m ≤ (ε 2 /9) V m* Aim: Maintain V m ≤ (ε 2 /9) V m* using as few buckets as possible B1B1 B m B2B2 ……………… Current window, size = N B m-1 B m*

Variance – algorithm EH algorithm for variance: Initially buckets contain 1 item each Merge adjacent buckets i, i+1 whenever the following condition holds: (9/ε 2 ) V i,i-1 ≤ V i-1* (i.e. variance of merged bucket is small compared to combined variance of later buckets)

Invariants Invariant 1: (9/ε 2 ) V i ≤ V i* Ensures that relative error is ≤ ε Invariant 2: (9/ε 2 ) V i,i-1 > V i-1* Ensures that number of buckets = O((1/ε 2 )log N) Each bucket requires O(1) space

Update and Query time Query Time: O(1) We maintain n, V & μ values for m and m* Update Time: O((1/ε 2 ) log N) worst case Time to check and combine buckets Can be made amortized O(1) Merge buckets periodically instead of after each new data element

k-medians summary (1/2) Assignment distance substitutes for variance Assignment distance obtained from an approximate clustering of points in the bucket Use hierarchical clustering algorithm [GMMO’00] Original points cluster to give level-1 medians Level-i medians cluster to give level-(i+1) medians Medians weighted by count of assigned points Each bucket maintains a collection of medians at various levels

k-medians summary (2/2) Merging buckets Combine medians from each level i If they exceed N τ in number, cluster to get level i+1 medians. Estimation procedure Weighted clustering of all medians from all buckets to produce k overall medians Estimating contribution of last bucket Pretend each point is at the closest median Relies on approximate counts of active points assigned to each median See paper for details!

Open Problems Variance: Close gap between upper and lower bounds (1/ε log N vs. 1/ε 2 log N) Improve update time from O(1) amortized to O(1) worst-case k-median clustering: [COP’03] give polylog N space approx. algorithm in streaming data model Can a similar result be obtained in the sliding window model?

Conclusion Algorithms to approximately maintain variance and k-median clustering in sliding window model Previous results using Exponential Histograms required “weak additivity” Not satisfied by variance or k-median clustering Adapted EHs for variance and k-median Techniques may be useful for other statistics that violate “weak additivity”