Approximation and Load Shedding Sampling Methods Carlo Zaniolo CSD—UCLA ________________________________________.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
An Approximate Truthful Mechanism for Combinatorial Auctions An Internet Mathematics paper by Aaron Archer, Christos Papadimitriou, Kunal Talwar and Éva.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Mining Data Streams.
On RAM PRIORITY QUEUES MIKKEL THORUP. Objective Sorting is a basic technique for a lot of algorithms. e.g. find the minimum edge of the graph, scheduling,
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Chapter 101 Cleaning Policy When should a modified page be written out to disk?  Demand cleaning write page out only when its frame has been selected.
The Bernoulli distribution Discrete distributions.
Medians and Order Statistics
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Heavy hitter computation over data stream
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Probability Distributions Finite Random Variables.
Probability Distributions
1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
A survey on stream data mining
Lecture 10: Search Structures and Hashing
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
1 Hash Tables  a hash table is an array of size Tsize  has index positions 0.. Tsize-1  two types of hash tables  open hash table  array element type.
Program Performance & Asymptotic Notations CSE, POSTECH.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
Collection of Data Chapter 4. Three Types of Studies Survey Survey Observational Study Observational Study Controlled Experiment Controlled Experiment.
Exact methods for ALB ALB problem can be considered as a shortest path problem The complete graph need not be developed since one can stop as soon as in.
Sorting Fun1 Chapter 4: Sorting     29  9.
Bernoulli Trials Two Possible Outcomes –Success, with probability p –Failure, with probability q = 1  p Trials are independent.
Binomial Experiment A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the following properties:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
FALL 2005 CENG 351 Data Management and File Structures 1 Hashing.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Discrete Random Variables. Discrete random variables For a discrete random variable X the probability distribution is described by the probability function,
Sampling for Windows on Data Streams by Vladimir Braverman
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
Hash table CSC317 We have elements with key and satellite data
GC 211:Data Structures Week 2: Algorithm Analysis Tools
GC 211:Data Structures Algorithm Analysis Tools
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Streaming & sampling.
Hash Table.
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Approximation and Load Shedding Sampling Methods
Clustering.
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

Approximation and Load Shedding Sampling Methods Carlo Zaniolo CSD—UCLA ________________________________________

Sampling zFundamental approximation method: to compute F on a set of objects W yPick a subset S of L (often |S|«|L|) yUse F(S) to approximate f(W) yBasic synopsis: can save computation, memory, or both 1.Sampling with replacement: Samples x 1,…,x k are independent (same object could be picked more than once) 2.Sampling without replacement: Repetitions are forbidden.

Simple Random Sample (SRS) SRS: i.e., sample of k elements chosen at random from a set with n elements Every possible sample (of size k) is equally likely, i.e., it has probability: 1/( ) where: Every element is equally likely to be in sample SRS can only be implemented if we know n: (e.g. by a random number generator) And even then, the resulting size might not be exactly k. nknk

Bernoulli Sampling zIncludes each element in the sample with probability q (e.g., if q= 1/2 flip a coin) z The sample size is not fixed, sample size is binomially distributed: probability that sample contains k elements is: zExpected sample size is: nq

Binomial Distribution -Example

Bernoulli Sampling -Implementation

Bernoulli Sampling: better implementation zBy skipping elements…after an insertion zThe probability of skipping exactly zzero elements is q zOne element is (1-q)q zTwo elements is (1-q)(1-q) … zi elements (1-q) i q zThe skip has a geometric distribution.

Geometric Skip This is implemented as:

Reservoir Sampling (Vitter 1985) Bernoulli sampling: (i) Cannot be used unless n is known, and (ii) if n is known probability k/n only guarantees a sample of approx. size k Reservoir sampling produces a SRS of specified size k from a set of unknown size n (k <= n) Algorithm: 1.Initialize a “reservoir” using first k elements 2.For every following element j >k, insert with probability k/ j (ignore with probability 1- k/ j ) 3.The element so inserted replaces a current element from the reservoir selected with probability 1/k.

Reservoir Sampling (cont.) zInsertion probability (p j = k/ j, j >k) decreases as j increases zAlso, opportunities for an element in the sample to be removed from the sample decrease as j increases zThese trends offset each other zProbability of being in final sample is provably the same for all elements of the input.

Windows count-based or time-based zReservoir sampling can extract k random elements from a set of arbitrary size W zIf W grows in size by adding additional elements—no problem. zBut windows on streams also loose elements! yNaïve solution: recompute the k-reservoir from scratch yOversampling: Keep a larger window—needs size O(k log n) yBetter solution: next slides?

CBW: Periodic Sampling p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time zWhen p i expires, take the new element zPick a sample p i from the first window zContinue…

Periodic Sampling: problems zVulnerability to malicious behavior yGiven one sample, it is possible to predict all future samples zPoor representation of periodic data yIf the period “agrees” with the sample zUnacceptable for most applications

Chain Method for Count Based Windows [Babcock et al. SODA 2002]  Include each new element in the sample with probability 1/min(i,n) zAs each element is added to the sample, choose the index of the element that will replace it when it expires zWhen the i th element expires, the window will be (i+ 1, …, i+n), so choose the index from this range zOnce the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements zWhen an element is chosen to be discarded from the sample, discard its “chain” as well

Memory Usage of Chain-Sample zLet T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x zThe expected length of each chain is less than T(n)  e  zIf the window contains k sample this be repeated k times (while avoiding collisions) zExpected memory usage is O(k) j<i

Timestamp-Based Windows (TBW) zWindow at time t consists of all elements whose arrival timestamp is at least t’ = t-m zThe number of elements in the window is not known in advance and may vary over time zThe chain algorithm does not work ySince it requires windows with a constant, known number of elements

Sampling TBWs [Babcock et al. SODA 2002] zImagine that all n elements in the window are assigned a random priority between 0 and 1 zThe living element with max (or min) priority is a valid sample of the window … zAs in the case of the max UDA, we can discard all window elements that are dominated by a later- time+higher priority pair. zFor k samples, simply find the top-k tuples… zTherefore expected memory usage is O(log n), or O(k log n) for samples of size k. O(k log n) is also an upper bound (whp)

Comparison of Algorithms for CBW AlgorithmExpected High-Probability PeriodicO(k) OversampleO(k log n) Chain-SampleO(k)O(k log n)

An Optimal Algorithm for CBW O(k) memory: [Braverman et al. PODS 09] For k samples over a count-based widow of size W: zThe stream is logically divided into tumbles of size W— called buckets in the paper. zFor each bucket, maintain k random samples by the reservoir algorithm zAs the window of size W slides over the buckets, you draw samples from the old bucket and the new one. p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/n B N/n+1 p N+2 p N+3 Time

p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/n B N/n+1 p N+2 p N+3 Active sliding window Bucket (size 5) Expired element Future element The active windows slides over two buckets: the old one where the samples are known, and the new one with some future elements

Bucket of size 5: Sample of size 1 p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 Time …. R1R1 R2R2 X Old bucket: s expired N -s active New bucket: s active N -s future Reservoir sampling used to compute R 2

p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/M B N/M+1 p N+2 Time …. X How to Select one sample out of a window of N elements. Step1: Select a random X between 1 and N Step2: X is not yet expired take it. Old bucket: s: expired N -s: active p N+3 Single sample: New bucket: s: active N -s: future

p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 Time …. R1R1 R2R2 X Step 2: X corresponds to an element p that has expired. In that case, take a single reservoir sample from the active segment of new window (s such elements)

Sequence-basedTimestamp-based With Replacement O(k)O(k*log n) Without Replacement O(k) O(k*log n) Win dow Sampling method Results: optimal solutions for all cases of uniform random sampling from sliding windows