Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Stream Algorithms Intro, Sampling, Entropy

Similar presentations


Presentation on theme: "Data Stream Algorithms Intro, Sampling, Entropy"— Presentation transcript:

1 Data Stream Algorithms Intro, Sampling, Entropy
Graham Cormode

2 Data Stream Algorithms - Introduction
Outline Introduction to Data Streams Motivating examples and applications Data Streaming models Basic tail bounds Sampling from data streams Sampling to estimate entropy Data Stream Algorithms - Introduction

3 Data Stream Algorithms - Introduction
Data is Massive Data is growing faster than our ability to store or index it There are 3 Billion Telephone Calls in US each day, 30 Billion s daily, 1 Billion SMS, IMs. Scientific data: NASA's observation satellites generate billions of readings each per day. IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers! Whole genome sequences for many species now available: each megabytes to gigabytes in size Data Stream Algorithms - Introduction

4 Data Stream Algorithms - Introduction
Massive Data Analysis Must analyze this massive data: Scientific research (monitor environment, species) System management (spot faults, drops, failures) Customer research (association rules, new offers) For revenue protection (phone fraud, service abuse) Else, why even measure this data? Data Stream Algorithms - Introduction

5 Data Stream Algorithms - Introduction
Example: Network Data Networks are sources of massive data: the metadata per hour per router is gigabytes Fundamental problem of data stream analysis: Too much information to store or transmit So process data as it arrives: one pass, small space: the data stream approach. Approximate answers to many questions are OK, if there are guarantees of result quality Data Stream Algorithms - Introduction

6 IP Network Monitoring Application
Source Destination Duration Bytes Protocol K http K http K http K http K http K ftp K ftp K ftp DSL/Cable Networks Broadband Internet Access Converged IP/MPLS Core PSTN Enterprise Networks Voice over IP FR, ATM, IP VPN Network Operations Center (NOC) SNMP/RMON, NetFlow records Peer Example NetFlow IP Session Data 24x7 IP packet/flow data-streams at network elements Truly massive streams arriving at rapid rates AT&T/Sprint collect ~1 Terabyte of NetFlow data each day Often shipped off-site to data warehouse for off-line analysis Data Stream Algorithms - Introduction

7 Packet-Level Data Streams
Single 2Gb/sec link; say avg packet size is 50bytes Number of packets/sec = 5 million Time per packet = 0.2 microsec If we only capture header information per packet: src/dest IP, time, no. of bytes, etc. – at least 10bytes. Space per second is 50Mb Space per day is 4.5Tb per link ISPs typically have hundreds of links! Analyzing packet content streams – order(s) of magnitude harder Data Stream Algorithms - Introduction

8 Network Monitoring Queries
DBMS (Oracle, DB2) Back-end Data Warehouse What are the top (most frequent) 1000 (source, dest) pairs seen over the last month? Off-line analysis – slow, expensive Network Operations Center (NOC) How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3? Set-Expression Query SELECT COUNT (R1.source, R2.dest) FROM R1, R2 WHERE R1.dest = R2.source SQL Join Query Peer R3 R1 R2 Enterprise Networks PSTN DSL/Cable Networks Extra complexity comes from limited space and time Will introduce solutions for these and other problems Data Stream Algorithms - Introduction

9 Streaming Data Questions
Network managers ask questions requiring us to analyze the data: How many distinct addresses seen on the network? Which destinations or groups use most bandwidth? Find hosts with similar usage patterns? Extra complexity comes from limited space and time Will introduce solutions for these and other problems Data Stream Algorithms - Introduction

10 Other Streaming Applications
Sensor networks Monitor habitat and environmental parameters Track many objects, intrusions, trend analysis… Utility Companies Monitor power grid, customer usage patterns etc. Alerts and rapid response in case of problems Data Stream Algorithms - Introduction

11 Streams Defining Frequency Dbns.
We will consider streams that define frequency distributions E.g. frequency of packets from source A to source B This simple setting captures many of the core algorithmic problems in data streaming How many distinct (non-zero) values seen? What is the entropy of the frequency distribution? What (and where) are the highest frequencies? More generally, can consider streams that define multi-dimensional distributions, graphs, geometric data etc. But even for frequency distributions, several models are relevant Data Stream Algorithms - Introduction

12 Data Stream Algorithms - Introduction
Data Stream Models We model data streams as sequences of simple tuples Complexity arises from massive length of streams Arrivals only streams: Example: (x, 3), (y, 2), (x, 2) encodes the arrival of 3 copies of item x, 2 copies of y, then 2 copies of x. Could represent eg. packets on a network; power usage Arrivals and departures: Example: (x, 3), (y,2), (x, -2) encodes final state of (x, 1), (y, 2). Can represent fluctuating quantities, or measure differences between two distributions x y x y Data Stream Algorithms - Introduction

13 Approximation and Randomization
Many things are hard to compute exactly over a stream Is the count of all items the same in two different streams? Requires linear space to compute exactly Approximation: find an answer correct within some factor Find an answer that is within 10% of correct result More generally, a (1 ) factor approximation Randomization: allow a small probability of failure Answer is correct, except with probability 1 in 10,000 More generally, success probability (1-) Approximation and Randomization: (, )-approximations Data Stream Algorithms - Introduction

14 Basic Tools: Tail Inequalities
General bounds on tail probability of a random variable (probability that a random variable deviates far from its expectation) Basic Inequalities: Let X be a random variable with expectation  and variance Var[X]. Then, for any >0 Probability distribution Tail probability Markov: Chebyshev: Data Stream Algorithms - Introduction

15 Tail Bounds Markov Inequality: Chebyshev’s Inequality:
For a random variable Y which takes only non-negative values. Pr[Y  k]  E(Y)/k (This will be < 1 only for k > E(Y)) Chebyshev’s Inequality: For a random variable Y: Pr[|Y-E(Y)|  k]  Var(Y)/k2 Proof: Set X = (Y – E(Y))2 E(X) = E(Y2+E(Y)2–2YE(Y)) = E(Y2)+E(Y)2-2E(Y)2= Var(Y) So: Pr[|Y-E(Y)|  k] = Pr[(Y – E(Y))2  k2]. Using Markov:  E(Y – E(Y))2/k2 = Var(Y)/k2 Data Stream Algorithms - Introduction

16 Data Stream Algorithms - Introduction
Outline Introduction to Data Streams Motivating examples and applications Data Streaming models Basic tail bounds Sampling from data streams Sampling to estimate entropy Data Stream Algorithms - Introduction

17 Sampling From a Data Stream
Fundamental prob: sample m items uniformly from stream Useful: approximate costly computation on small sample Challenge: don’t know how long stream is So when/how often to sample? Two solutions, apply to different situations: Reservoir sampling (dates from 1980s?) Min-wise sampling (dates from 1990s?) Data Stream Algorithms - Introduction

18 Data Stream Algorithms - Introduction
Reservoir Sampling Sample first m items Choose to sample the i’th item (i>m) with probability m/i If sampled, randomly replace a previously sampled item Optimization: when i gets large, compute which item will be sampled next, skip over intervening items. [Vitter 85] Data Stream Algorithms - Introduction

19 Reservoir Sampling - Analysis
Analyze simple case: sample size m = 1 Probability i’th item is the sample from stream length n: Prob. i is sampled on arrival  prob. i survives to end 1 i i+1 n-2 n-1 i i+1 i+2 n-1 n   …  = 1/n Case for m > 1 is similar, easy to show uniform probability Drawbacks of reservoir sampling: hard to parallelize Data Stream Algorithms - Introduction

20 Data Stream Algorithms - Introduction
Min-wise Sampling For each item, pick a random fraction between 0 and 1 Store item(s) with the smallest random tag [Nath et al.’04] 0.391 0.908 0.291 0.555 0.619 0.273 Each item has same chance of least tag, so uniform Can run on multiple streams separately, then merge Data Stream Algorithms - Introduction

21 Data Stream Algorithms - Introduction
Sampling Exercises What happens when each item in the stream also has a weight attached, and we want to sample based on these weights? Generalize the reservoir sampling algorithm to draw a single sample in the weighted case. Generalize reservoir sampling to sample multiple weighted items, and show an example where it fails to give a meaningful answer. Research problem: design new streaming algorithms for sampling in the weighted case, and analyze their properties. Data Stream Algorithms - Introduction

22 Data Stream Algorithms - Introduction
Outline Introduction to Data Streams Motivating examples and applications Data Streaming models Basic tail bounds Sampling from data streams Sampling to estimate entropy Data Stream Algorithms - Introduction

23 Application of Sampling: Entropy
Given a long sequence of characters S = <a1, a2, a3… am> each aj  {1… n} Let fi = frequency of i in the sequence Compute the empirical entropy: H(S) = - i fi/m log fi/m = - i pi log pi Example: S = < a, b, a, b, c, a, d, a> pa = 1/2, pb = 1/4, pc = 1/8, pd = 1/8 H(S) = ½ + ¼  2 + 1/8  3 + 1/8  3 = 7/4 Entropy promoted for anomaly detection in networks Data Stream Algorithms - Introduction

24 Data Stream Algorithms - Introduction
Challenge Goal: approximate H(S) in space sublinear (poly-log) in m (stream length), n (alphabet size) (,) approx: answer is (1§)H(S) w/prob 1- Easy if we have O(n) space: compute each fi exactly More challenging if n is huge, m is huge, and we have only one pass over the input in order (The data stream model) Data Stream Algorithms - Introduction

25 Sampling Based Algorithm
Simple estimator: Randomly sample a position j in the stream Count how many times aj appears subsequently = r Output X = -(r log (r/m) – (r-1) log((r-1)/m)) Claim: Estimator is unbiased – E[X] = H(S) Proof: prob of picking j = 1/m, sum telescopes correctly Variance of estimate is not too large – Var[X] = O(log2 m) Observe that |X| ≤ log m Var[X] = E[(X – E[X])2] < (max(X) – min(X))2 = O(log2 m) Data Stream Algorithms - Introduction

26 Analysis of Basic Estimator
A general technique in data streams: Repeat in parallel an unbiased estimator with bounded variance, take average of estimates to improve variance Var[ 1/k (Y1 + Y Yk) ] = 1/k Var[Y] By Chebyshev, need k repetitions to be Var[X]/2E2[X] For entropy, this means space O(log2m/2H2(S)) Problem for entropy: when H(S) is very small? Space needed for an accurate approx goes as 1/H2! Data Stream Algorithms - Introduction

27 Data Stream Algorithms - Introduction
Low Entropy But... what does a low entropy stream look like? aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabaaaaa Very boring most of the time, we are only rarely surprised Can there be two frequent items? aabababababababaababababbababababababa No! That’s high entropy (¼ 1 bit / character) Only way to get H(S) =o(1) is to have only one character with pi close to 1 Data Stream Algorithms - Introduction

28 Removing the frequent character
Write entropy as -pa log pa + (1-pa) H(S’) Where S’ = stream S with all ‘a’s removed Can show: Doesn’t matter if H(S’) is small: as pa is large, additive error on H(S’) ensures relative error on (1-pa)H(S’) Relative error (1-pa) on pa gives relative error on pa log pa Summing both (positive) terms gives relative error overall Data Stream Algorithms - Introduction

29 Data Stream Algorithms - Introduction
Finding the boring guy Ejecting a is easy if we know in advance what it is Can then compute pa exactly Can find online deterministically Assume pa > 2/3 (if not, H(S) > 0.9, and original alg works) Run a ‘heavy hitters’ algorithm on the stream (see later) Modify analysis, find a and pa §  (1-pa) But... how to also compute H(S’) simultaneously if we don’t know a from the start... do we need two passes? Data Stream Algorithms - Introduction

30 Always have a back up plan...
Idea: keep two samples to build our estimator If at the end one of our samples is ‘a’, use the other How to do this and ensure uniform sampling? Pick first sample with ‘min-wise sampling’: At end of the stream, if the sampled character = ‘a’, we want to sample from the stream ignoring all ‘a’s This is just “the character achieving the smallest label distinct from the one that achieves the smallest label” Can track information to do this in a single pass, constant space Data Stream Algorithms - Introduction

31 Sampling Two Tokens Assign tags, choose first token as before C A A B
Stream: C A A B min tag amongst remaining tokens B A min tag B D second smallest tag, but we don’t want this; same token as min tag! C A B A Tags: 0.408 0.815 0.217 0.191 0.770 0.082 0.366 0.228 0.549 0.173 0.627 0.202 Repeats: B A A A Assign tags, choose first token as before Delete all occurrences of first token Choose token with min remaining tag; count repeats Implementation: keep track of two triples (min tag, corresponding token, number of repeats) Data Stream Algorithms - Introduction

32 Putting it all together
Can combine all these pieces Build an estimator based on tracking this information, deciding whether there is a frequent character or not A more involved Chernoff bounds argument improves number of repetitions of estimator from O(-2Var[X]/E2[X]) to O(-2Range[X]/E[X]) = O(-2 log m) In space O(-2 log m log 1/) space we can compute an (,) approximation to H(S) in a single pass Data Stream Algorithms - Introduction

33 Data Stream Algorithms - Introduction
Entropy Exercises As a subroutine, we need to find an element that occurs more than 2/3 of the time and estimate its weight How can we find a frequently occurring item? How can we estimate its weight p with (1-p) error? Our algorithm uses O(-2 log m log 1/) space, could this be improved or is it optimal (lower bounds)? Our algorithm updates each sampled pair for every update, how quickly can we implement it? (Research problem) What if there are multiple distributed streams and we want to compute the entropy of their union? Data Stream Algorithms - Introduction

34 Data Stream Algorithms - Introduction
Outline Introduction to Data Streams Motivating examples and applications Data Streaming models Basic tail bounds Sampling from data streams Sampling to estimate entropy Data Stream Algorithms - Introduction


Download ppt "Data Stream Algorithms Intro, Sampling, Entropy"

Similar presentations


Ads by Google