Approximation and Load Shedding Sampling Methods

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Mining Data Streams.
On RAM PRIORITY QUEUES MIKKEL THORUP. Objective Sorting is a basic technique for a lot of algorithms. e.g. find the minimum edge of the graph, scheduling,
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
The Bernoulli distribution Discrete distributions.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Probability Distributions
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Approximation and Load Shedding Sampling Methods Carlo Zaniolo CSD—UCLA ________________________________________.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Randomized Algorithms - Treaps
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Stacks - 2 Nour El-Kadri ITI Implementing 2 stacks in one array and we don’t mean two stacks growing in the same direction: but two stacks growing.
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Sorting Fun1 Chapter 4: Sorting     29  9.
Bernoulli Trials Two Possible Outcomes –Success, with probability p –Failure, with probability q = 1  p Trials are independent.
Binomial Experiment A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the following properties:
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: CLRS “Intro.
Data Structure Introduction.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
CSS446 Spring 2014 Nan Wang.  To understand the implementation of linked lists and array lists  To analyze the efficiency of fundamental operations.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Sampling for Windows on Data Streams by Vladimir Braverman
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Amortized Analysis In amortized analysis, the time required to perform a sequence of operations is averaged over all the operations performed Aggregate.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Mining Data Streams (Part 1)
Now it’s time to look at…
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
Frequency Counts over Data Streams
Random Variables.
Hash table CSC317 We have elements with key and satellite data
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Finding Frequent Items in Data Streams
Conditional Probability
Streaming & sampling.
Hash Table.
Hash Table.
Quick-Sort 11/14/2018 2:17 PM Chapter 4: Sorting    7 9
Quick-Sort 11/19/ :46 AM Chapter 4: Sorting    7 9
עידן שני ביה"ס למדעי המחשב אוניברסיטת תל-אביב
Hash Tables – 2 Comp 122, Spring 2004.
Indexing and Hashing Basic Concepts Ordered Indices
Range-Efficient Computation of F0 over Massive Data Streams
Quick-Sort 2/23/2019 1:48 AM Chapter 4: Sorting    7 9
Introduction to Stream Computing and Reservoir Sampling
Now it’s time to look at…
Elementary Sorting Algorithms
Clustering.
The Selection Problem.
Hash Tables – 2 1.
Chapter 11 Probability.
Presentation transcript:

Approximation and Load Shedding Sampling Methods Carlo Zaniolo CSD—UCLA ________________________________________

Sampling Fundamental approximation method: to compute F on a set of objects W Pick a subset S of L (often |S|«|L|) Use F(S) to approximate f(W) Basic synopsis: can save computation, memory, or both Sampling with replacement: Samples x1,…,xk are independent (same object could be picked more than once) Sampling without replacement: Repeated selection of same tuple are forbidden.

Simple Random Sample (SRS) SRS: i.e., sample of k elements chosen at random from a set with n elements Every possible sample (of size k) is equally likely, i.e., it has probability: 1/( ) where: Every element is equally likely to be in sample SRS can only be implemented if we know n: (e.g. by a random number generator) And even then, the resulting size might not be exactly k. n k

Bernoulli Sampling Includes each element in the sample with probability q (e.g., if q=1/2 flip a coin) The sample size is not fixed, sample size is binomially distributed: probability that sample contains k elements is: Expected sample size is: nq

Binomial Distribution -Example

Binomial Distribution -Example

Bernoulli Sampling -Implementation

Bernoulli Sampling: better implementation By skipping elements…after an insertion The probability of skipping exactly zero elements (i.e selecting the next) is q One element is (1-q)q Two elements is (1-q)(1-q) … i elements (1-q)i q The skip has a geometric distribution.

Geometric Skip This is implemented as:

Reservoir Sampling (Vitter 1985) Bernoulli sampling: (i) Cannot be used unless n is known, and (ii) if n is known probability k/n only guarantees a sample of approx. size k Reservoir sampling produces a random sample of specified size k from a set of unknown size n (k <= n) Algorithm: Initialize a “reservoir” using first k elements For every following element j>k, insert with probability k/j (ignore with probability 1- k/j) The element so inserted replaces a current element from the reservoir selected with probability 1/k.

Reservoir Sampling (cont.) Insertion probability (pj = k/j, j>k) decreases as j increases Also, opportunities for an element in the sample to be removed from the sample decrease as j increases These trends offset each other Probability of being in final sample is provably the same for all elements of the input.

Windows count-based or time-based Reservoir sampling can extract k random elements from a set of arbitrary size W If W grows in size by adding additional elements—no problem. But windows on streams also loose elements! Naïve solution: recompute the k-reservoir from scratch Oversampling: Keep a larger window—needs size O(k log n) Better solution: next slides?

CBW: Periodic Sampling When pi expires, take the new element Pick a sample pi from the first window Continue… Time p1 p2 p3 p4 p5 p6 p7 p8

Periodic Sampling: problems Vulnerability to malicious behavior Given one sample, it is possible to predict all future samples Poor representation of periodic data If the period “agrees” with the sample Unacceptable for most applications

Chain Method for Count Based Windows [Babcock et al. SODA 2002] Include each new element in the sample with probability 1/min(i,n) As each element is added to the sample, choose the index of the element that will replace it when it expires When the ith element expires, the window will be (i+1, …, i+n), so choose the index from this range Once the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements When an element is chosen to be discarded from the sample ( discard its “chain” as well.

Memory Usage of Chain-Sample Let T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x The expected length of each chain is less than T(n)  e  2.718 If the window contains k sample this will be repeated k times (while avoiding collisions) Expected memory usage is O(k) j<i

Timestamp-Based Windows (TBW) Window at time t consists of all elements whose arrival timestamp is at least t’ = t-m The number of elements in the window is not known in advance and may vary over time The chain algorithm does not work Since it requires windows with a constant, known number of elements

Sampling TBWs [Babcock et al. SODA 2002] Imagine that all n elements in the window are assigned a random priority between 0 and 1 The living element with max (or min) priority is a valid sample of the window … As in the case of the max UDA, we can discard all window elements that are dominated by a later-time+higher priority pair. For k samples, simply keep the top-k tuples… Therefore expected memory usage is O(log n) for a single sample, and O(k log n) for a sample of size k. O(k log n) is also an upper bound (with high prob.)

Comparison of Algorithms for CBW Expected High-Probability Periodic O(k) Oversample O(k log n) Chain-Sample

An Optimal Algorithm for CBW O(k) memory: [Braverman et al. PODS 09] For k samples over a count-based widow of size W: The stream is logically divided into tumbles of size W—called buckets in our paper. For each bucket, maintain k random samples by the reservoir algorithm As the window of size W slides over the buckets, you draw samples from the old bucket and the new one. E..G for a single sample Time B1 B2 BN/n BN/n+1 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 pN-6 pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1 pN+2 pN+3 ….

The active windows slides over two buckets: the old one where the samples are known, and the new one with some future elements Active sliding window Future element Expired element Time B1 B2 BN/n BN/n+1 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 pN-6 pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1 pN+2 pN+3 …. Bucket (size 5)

Bucket of size 5: Sample of size 1 X R1 R2 Time BN/n BN/n+1 pN-6 pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1 pN+2 pN+3 …. …. New bucket: s active N-s future Reservoir sampling used to compute R2 Old bucket: s expired N-s active

How to Select one sample out of a window of N elements How to Select one sample out of a window of N elements. Step1: Select a random X between 1 and N Step2: X is not yet expired take it. New bucket: s: active N-s: future Old bucket: s: expired N-s: active X Time BN/M BN/M+1 pN-6 pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1 pN+2 pN+3 …. …. Single sample:

Step 2: X corresponds to an element p that has expired Step 2: X corresponds to an element p that has expired. In that case, take a single reservoir sample from the active segment of new window (s such elements) X R1 R2 Time BN/n BN/n+1 pN-6 pN-5 pN-4 pN-3 pN-2 pN-1 pN pN+1 pN+2 pN+3 …. ….

Results: optimal solutions for all cases of uniform random sampling from sliding windows Sampling method Sequence-based Timestamp-based With Replacement O(k) O(k*log n) Without Replacement Window