Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.

 Data Streams ◦ Data streams — continuous, ordered, changing, fast, huge amount data sets ◦ Traditional DBMS — data stored in finite, persistent data sets  Characteristics ◦ Huge volumes of continuous data, possibly infinite ◦ Fast changing and requires fast, real-time response ◦ Random access is expensive — single scan algorithm (only single pass) ◦ Store only the summary of the data seen thus far ◦ Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing 3 What are Data Streams? Ack. From Jiawei Han

 Telecommunication calling records  Business: credit card transaction flows  Network monitoring and traffic engineering  Financial market: stock exchange  Engineering & industrial processes: power supply & manufacturing  Sensor, monitoring & surveillance: video streams, RFIDs  Security monitoring  Web logs and Web page click streams  Massive data sets (even saved but random access is too expensive) 4 Examples Ack. From Jiawei Han

5 DBMS versus DSMS  Persistent relations  One-time queries  Random access  “Unbounded” disk store  Only current state matters  No real-time services  Relatively low update rate  Data at any granularity  Assume precise data  Access plan determined by query processor, physical DB design  Transient streams  Continuous queries  Sequential access  Bounded main memory  Historical data is important  Real-time requirements  Possibly multi-GB arrival rate  Data at fine granularity  Data stale/imprecise  Unpredictable/variable data arrival and characteristics Ack. From Motwani’s PODS tutorial slides

6 In General: Streaming algorithm X1X1 stream processing engine estimate of θ, summary (in memory) Continuous Data Stream (Terabytes) (Gigabytes) XnXn X2X2 where θ = g(X 1,...,X n ) “indirect” observation Query Q … Hashing © 2015 Bruno Ribeiro

 Query types ◦ One-time query vs. continuous query (being evaluated continuously as stream continues to arrive) ◦ Predefined query vs. ad-hoc query (issued on-line)  Unbounded memory requirements ◦ For real-time response, main memory algorithm should be used ◦ Memory requirement is unbounded if one will join future tuples  Approximate query answering ◦ With bounded memory, it is not always possible to produce exact answers ◦ High-quality approximate answers are desired ◦ Data reduction and synopsis construction methods  Sketches, random sampling, histograms, wavelets, etc. 7 Querying Ack. From Jiawei Han

 Major challenges ◦ Keep track of a large universe, e.g., pairs of IP address, not ages  Methodology ◦ Synopses (trade-off between accuracy and storage): A summary given in brief terms that covers the major points of a subject matter ◦ Use synopsis data structure, much smaller (O(log k N) space) than their base data set (O(N) space) ◦ Compute an approximate answer within a small error range (factor ε of the actual answer)  Major methods ◦ Random sampling ◦ Histograms ◦ Sliding windows ◦ Multi-resolution model ◦ Sketches ◦ Radomized algorithms 8 Synopses/Approximate Answers Ack. From Jiawei Han

 Sliding windows ◦ Only over sliding windows of recent stream data ◦ Approximation but often more desirable in applications  Batched processing, sampling and synopses ◦ Batched if update is fast but computing is slow  Compute periodically, not very timely ◦ Sampling if update is slow but computing is fast  Compute using sample data ◦ Synopsis data structures  Maintain a small synopsis or sketch of data  Good for querying historical data  Blocking operators, e.g., sorting, avg, min, etc. ◦ Blocking if unable to produce the first output until seeing the entire input 9 Types of Streaming Algorihms Ack. From Jiawei Han

 Random sampling (but without knowing the total length in advance)  Sliding windows ◦ Make decisions based only on recent data of sliding window size w ◦ An element arriving at time t expires at time t + w  Histograms ◦ Approximate the frequency distribution of element values in a stream ◦ Partition data into a set of contiguous buckets ◦ Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)  Multi-resolution models ◦ Popular models: balanced binary trees, micro-clusters, and wavelets 10 Stream Processing Ack. From Jiawei Han

12 Random Sampling: Packet sampling Router Internet Bernoulli sampling Internet Widely used: processing overhead controlled by sampling rate (1/200) Traffic summary: * Find % traffic from Netflix @ Purdue Estimate packet-level statistics >> © 2015 Bruno Ribeiro

16 Flow size distribution: maximum likelihood estimation  sampling rate = 1/200  128,000 sampled flows  EM algorithm ◦ 2 initializations Estimates highly sensitive to initialization © 2015 Bruno Ribeiro

 Dedicates precious memory only to “important” observations  Sample flows, rather than packets ◦ Problem? ◦ Will likely miss large flows  Sample flows ∝ flow size ◦ Problem? ◦ Streaming setting: We don’t yet know the size  Example pf compromise: Sample and Hold ◦ Sample packets, keep all remaining packets of same flow 19 Importance Sampling © 2015 Bruno Ribeiro

Different Sampling Designs  Packet Sampling = Packet Sampling: Sample elements with probability p  Flow Sampling = Flow sampling: Sample sets with probability q  Sample & Hold = Randomly sample elements with probability q’ from the stream but collect all future elements with same color  Dual Sampling = Sample first element with high probability. Sample following elements with low probability and use “sequence numbers” to obtain elements lost “in the middle” moca seeing as a stream of elements jg © 2015 Bruno Ribeiro

 Note that not every problem can be solved well with sampling ◦Example: flow size estimation  “Sketch”:  “Sketch”: a linear transformation of the input ◦Model stream as defining a vector, sketch is result of multiplying stream vector by an (implicit) matrix linear projection stream sketch X1X1 XnXn X2X2 … © 2015 Bruno Ribeiro

25 Counters elements of flow blue elements of flow red elements of flow green f f f collision Flow size distribution Motivation: ❍ Estimate flow size distribution Hash function f Uses precious memory with counters > 0 Hash function: Uniformly at random associates a newly arrived flow to a counter © 2015 Bruno Ribeiro

26 0 Data Streaming on Flow Size Estimation router Estimation phase powerful back end server powerful back end server 0 0 universal hash function 1 12 0 0 Sketch phase 1 2 collision!! counters summary flow size distribution estimate Disambiguate © 2015 Bruno Ribeiro

Eviction Sketch: Probabilistic collision avoidance 201600612 Flows: flow 7 flow 8  Maximum hash value = M  M/2 counters  If hash(packet) < M/2 → red  Otherwise (hash(packet) mod M/2) → blue flow 9 Counters: M/2 counters Undetectable collision Detectable blue – red collision: 1 bit required © 2015 Bruno Ribeiro

Eviction Number of eviction classes ∞ Policy: Evicts random flow Flow sampling Folding: interesting fact Collision policy:  “red flow cannot increment blue counter”  “blue flow overwrites red counter”  counter = 0 are red Result: e.g. if 1 counter / flow  All red counters are also blue counters = 0  Virtually expands hash table in ≈ 50% (virtual 2 counters/ flow)  Blue counters evict red counters  Flow sampling effect: Discards 15% flows at random Result: e.g. if 1 counter / flow  All red counters are also blue counters = 0  Virtually expands hash table in ≈ 50% (virtual 2 counters/ flow)  Blue counters evict red counters  Flow sampling effect: Discards 15% flows at random 201300612 Flows: Counters: 000100111 Counter colors: (extra bit) © 2015 Bruno Ribeiro

Reduce counter size: Probabilisitc counter increments With m a = 2 ª, 6 bit counter bins up flows up to average size 10 14 0 0 1 1 Arrived packets: … k-1 2 2 k k k+1 p=1/m 1 k+2 … m1m1 … m2m2 p=1/m 2 k average Hash counter  Counter value k → average flow sizes = [k, k+m 1 -1]  Counter value k+1 → average flow sizes = [k+m 1, k+m 1 +m 2 -1] © 2015 Bruno Ribeiro

Experiment  Evaluated with simulations  Our worst result with Internet core traces ◦ 9.5 million flows ◦ 8MB of memory ◦ k=16 ◦ W=10 14 k Same accuracy without counter folding requires 13MB of memory © 2015 Bruno Ribeiro

 S = set of items  m = |S|  k = hash functions  n = number of stored bits in filter  Assume kn < m  To check membership: y ∊ S, check whether f i ( y ), 1≤i≤k, are all set to 1 o If not, y ∉ S o Else, we conclude that y ∊ S, but sometimes y ∉ S (false positive)  In many applications, false positives are OK as long as happens with small probability Why Bloom Filters Work © 2015 Bruno Ribeiro

Bloom Filter Errors  Assumption: Hash functions look random  Given m bits for filter and n elements, choose number k of hash functions to minimize false positives: ◦ Let ◦ Then,  As k increases, more chances to find at least one 0 but we also insert more 1 ’s in bit vector  Optimal at k = (ln 2)m/n (derivative = 0, 2nd deriv > 0) © 2015 Bruno Ribeiro

Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.

Similar presentations

Presentation on theme: "Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.

Similar presentations

Presentation on theme: "Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro."— Presentation transcript:

Similar presentations

About project

Feedback