Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
A Fast and Compact Method for Unveiling Significant Patterns in High-Speed Networks Tian Bu 1, Jin Cao 1, Aiyou Chen 1, Patrick P. C. Lee 2 Bell Labs,
Mining Data Streams.
Estimating TCP Latency Approximately with Passive Measurements Sriharsha Gangam, Jaideep Chandrashekar, Ítalo Cunha, Jim Kurose.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Indian Statistical Institute Kolkata
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Bloom Filters Kira Radinsky Slides based on material from:
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
A survey on stream data mining
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Bruno Ribeiro CS69000-DM1 Topics in Data Mining. Bruno Ribeiro  Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Data Stream Management Systems
Data Warehousing Mining & BI Data Streams Mining DWMBI1.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Data Mining: Concepts and Techniques Mining data streams
Calculating frequency moments of Data Stream
Mining of Massive Datasets Ch4. Mining Data Streams
Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
Mining Data Streams (Part 1)
Big Data Infrastructure
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
A Resource-minimalist Flow Size Histogram Estimator
The Variable-Increment Counting Bloom Filter
Streaming & sampling.
Advanced Topics in Data Management
Approximate Frequency Counts over Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Heavy Hitters in Streams and Sliding Windows
Online Analytical Processing Stream Data: Is It Feasible?
Mining Data Streams Many slides are borrowed from Stanford Data Mining Class and Prof. Jiawei Han’s lecture slides.
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro

 Stream item counting  Stream statistics  Stream classification  Stream matching 2 Data Streams Applications © 2015 Bruno Ribeiro

 Data Streams ◦ Data streams — continuous, ordered, changing, fast, huge amount data sets ◦ Traditional DBMS — data stored in finite, persistent data sets  Characteristics ◦ Huge volumes of continuous data, possibly infinite ◦ Fast changing and requires fast, real-time response ◦ Random access is expensive — single scan algorithm (only single pass) ◦ Store only the summary of the data seen thus far ◦ Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing 3 What are Data Streams? Ack. From Jiawei Han

 Telecommunication calling records  Business: credit card transaction flows  Network monitoring and traffic engineering  Financial market: stock exchange  Engineering & industrial processes: power supply & manufacturing  Sensor, monitoring & surveillance: video streams, RFIDs  Security monitoring  Web logs and Web page click streams  Massive data sets (even saved but random access is too expensive) 4 Examples Ack. From Jiawei Han

5 DBMS versus DSMS  Persistent relations  One-time queries  Random access  “Unbounded” disk store  Only current state matters  No real-time services  Relatively low update rate  Data at any granularity  Assume precise data  Access plan determined by query processor, physical DB design  Transient streams  Continuous queries  Sequential access  Bounded main memory  Historical data is important  Real-time requirements  Possibly multi-GB arrival rate  Data at fine granularity  Data stale/imprecise  Unpredictable/variable data arrival and characteristics Ack. From Motwani’s PODS tutorial slides

6 In General: Streaming algorithm X1X1 stream processing engine estimate of θ, summary (in memory) Continuous Data Stream (Terabytes) (Gigabytes) XnXn X2X2 where θ = g(X 1,...,X n ) “indirect” observation Query Q … Hashing © 2015 Bruno Ribeiro

 Query types ◦ One-time query vs. continuous query (being evaluated continuously as stream continues to arrive) ◦ Predefined query vs. ad-hoc query (issued on-line)  Unbounded memory requirements ◦ For real-time response, main memory algorithm should be used ◦ Memory requirement is unbounded if one will join future tuples  Approximate query answering ◦ With bounded memory, it is not always possible to produce exact answers ◦ High-quality approximate answers are desired ◦ Data reduction and synopsis construction methods  Sketches, random sampling, histograms, wavelets, etc. 7 Querying Ack. From Jiawei Han

 Major challenges ◦ Keep track of a large universe, e.g., pairs of IP address, not ages  Methodology ◦ Synopses (trade-off between accuracy and storage): A summary given in brief terms that covers the major points of a subject matter ◦ Use synopsis data structure, much smaller (O(log k N) space) than their base data set (O(N) space) ◦ Compute an approximate answer within a small error range (factor ε of the actual answer)  Major methods ◦ Random sampling ◦ Histograms ◦ Sliding windows ◦ Multi-resolution model ◦ Sketches ◦ Radomized algorithms 8 Synopses/Approximate Answers Ack. From Jiawei Han

 Sliding windows ◦ Only over sliding windows of recent stream data ◦ Approximation but often more desirable in applications  Batched processing, sampling and synopses ◦ Batched if update is fast but computing is slow  Compute periodically, not very timely ◦ Sampling if update is slow but computing is fast  Compute using sample data ◦ Synopsis data structures  Maintain a small synopsis or sketch of data  Good for querying historical data  Blocking operators, e.g., sorting, avg, min, etc. ◦ Blocking if unable to produce the first output until seeing the entire input 9 Types of Streaming Algorihms Ack. From Jiawei Han

 Random sampling (but without knowing the total length in advance)  Sliding windows ◦ Make decisions based only on recent data of sliding window size w ◦ An element arriving at time t expires at time t + w  Histograms ◦ Approximate the frequency distribution of element values in a stream ◦ Partition data into a set of contiguous buckets ◦ Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)  Multi-resolution models ◦ Popular models: balanced binary trees, micro-clusters, and wavelets 10 Stream Processing Ack. From Jiawei Han

11 Random Sampling: A Simple Approach to Item Counts © 2015 Bruno Ribeiro

12 Random Sampling: Packet sampling Router Internet Bernoulli sampling Internet Widely used: processing overhead controlled by sampling rate (1/200) Traffic summary: * Find % traffic from Purdue Estimate packet-level statistics >> © 2015 Bruno Ribeiro

 Find % connections from Purdue 13 A Fair Measure: Flow-level Statistics Estimate flow-level statistics >> Estimate flow size distribution © 2015 Bruno Ribeiro

 Reverse problem (inference problem) 14 Flow-level Statistics from Sampled Packets? mocajg © 2015 Bruno Ribeiro

15 Finding estimates – schematic view Sampling Estimator © 2015 Bruno Ribeiro

16 Flow size distribution: maximum likelihood estimation  sampling rate = 1/200  128,000 sampled flows  EM algorithm ◦ 2 initializations Estimates highly sensitive to initialization © 2015 Bruno Ribeiro

17 MLE: more samples pkt sampling rate = 1/200, 1 trillion sampled flows © 2015 Bruno Ribeiro

 Surface: 71% is water 18 Problem: Uniform sampling Wikipedia © 2015 Bruno Ribeiro

 Dedicates precious memory only to “important” observations  Sample flows, rather than packets ◦ Problem? ◦ Will likely miss large flows  Sample flows ∝ flow size ◦ Problem? ◦ Streaming setting: We don’t yet know the size  Example pf compromise: Sample and Hold ◦ Sample packets, keep all remaining packets of same flow 19 Importance Sampling © 2015 Bruno Ribeiro

Different Sampling Designs  Packet Sampling = Packet Sampling: Sample elements with probability p  Flow Sampling = Flow sampling: Sample sets with probability q  Sample & Hold = Randomly sample elements with probability q’ from the stream but collect all future elements with same color  Dual Sampling = Sample first element with high probability. Sample following elements with low probability and use “sequence numbers” to obtain elements lost “in the middle” moca seeing as a stream of elements jg © 2015 Bruno Ribeiro

Results: Different Sampling Designs  FS = Flow sampling  SH = Sample and hold DS = Dual sampling PS = Packet sampling Tune & Veitch, 2014 © 2015 Bruno Ribeiro

22 Sketches © 2015 Bruno Ribeiro

 Note that not every problem can be solved well with sampling ◦Example: flow size estimation  “Sketch”:  “Sketch”: a linear transformation of the input ◦Model stream as defining a vector, sketch is result of multiplying stream vector by an (implicit) matrix linear projection stream sketch X1X1 XnXn X2X2 … © 2015 Bruno Ribeiro

24  Definitions ◦ N → number of flows ◦ W → maximum flow size ◦ M → memory size  Space Complexity ◦ Available memory M = k N log W, k < 1 © 2015 Bruno Ribeiro

25 Counters elements of flow blue elements of flow red elements of flow green f f f collision Flow size distribution Motivation: ❍ Estimate flow size distribution Hash function f Uses precious memory with counters > 0 Hash function: Uniformly at random associates a newly arrived flow to a counter © 2015 Bruno Ribeiro

26 0 Data Streaming on Flow Size Estimation router Estimation phase powerful back end server powerful back end server 0 0 universal hash function Sketch phase 1 2 collision!! counters summary flow size distribution estimate Disambiguate © 2015 Bruno Ribeiro

 Effectively only works if counter load < 2  In practice reduces required memory by 1/2  Very resource-intensive estimation procedure 27 Issues with Kumar et al. © 2015 Bruno Ribeiro

 Ribeiro et al Eviction Sketch © 2015 Bruno Ribeiro

Eviction Sketch: Probabilistic collision avoidance Flows: flow 7 flow 8  Maximum hash value = M  M/2 counters  If hash(packet) < M/2 → red  Otherwise (hash(packet) mod M/2) → blue flow 9 Counters: M/2 counters Undetectable collision Detectable blue – red collision: 1 bit required © 2015 Bruno Ribeiro

Eviction Number of eviction classes ∞ Policy: Evicts random flow Flow sampling Folding: interesting fact Collision policy:  “red flow cannot increment blue counter”  “blue flow overwrites red counter”  counter = 0 are red Result: e.g. if 1 counter / flow  All red counters are also blue counters = 0  Virtually expands hash table in ≈ 50% (virtual 2 counters/ flow)  Blue counters evict red counters  Flow sampling effect: Discards 15% flows at random Result: e.g. if 1 counter / flow  All red counters are also blue counters = 0  Virtually expands hash table in ≈ 50% (virtual 2 counters/ flow)  Blue counters evict red counters  Flow sampling effect: Discards 15% flows at random Flows: Counters: Counter colors: (extra bit) © 2015 Bruno Ribeiro

Reduce counter size: Probabilisitc counter increments With m a = 2 ª, 6 bit counter bins up flows up to average size Arrived packets: … k k k k+1 p=1/m 1 k+2 … m1m1 … m2m2 p=1/m 2 k average Hash counter  Counter value k → average flow sizes = [k, k+m 1 -1]  Counter value k+1 → average flow sizes = [k+m 1, k+m 1 +m 2 -1] © 2015 Bruno Ribeiro

Experiment  Evaluated with simulations  Our worst result with Internet core traces ◦ 9.5 million flows ◦ 8MB of memory ◦ k=16 ◦ W=10 14 k Same accuracy without counter folding requires 13MB of memory © 2015 Bruno Ribeiro

33 Input: 10 6 flows with 250KB memory © 2015 Bruno Ribeiro

34 © 2015 Bruno Ribeiro Good Tutorial: Andrei Broder and Michael Mitzenmacher, Network Applications of Bloom Filters: A Survey, Internet Mathematics Vol. 1, No. 4: , 2003

35 How Bloom Filters Work Hash function f 1 f2f2 f3f3 © 2015 Bruno Ribeiro

 S = set of items  m = |S|  k = hash functions  n = number of stored bits in filter  Assume kn < m  To check membership: y ∊ S, check whether f i ( y ), 1≤i≤k, are all set to 1 o If not, y ∉ S o Else, we conclude that y ∊ S, but sometimes y ∉ S (false positive)  In many applications, false positives are OK as long as happens with small probability Why Bloom Filters Work © 2015 Bruno Ribeiro

Bloom Filter Errors  Assumption: Hash functions look random  Given m bits for filter and n elements, choose number k of hash functions to minimize false positives: ◦ Let ◦ Then,  As k increases, more chances to find at least one 0 but we also insert more 1 ’s in bit vector  Optimal at k = (ln 2)m/n (derivative = 0, 2nd deriv > 0) © 2015 Bruno Ribeiro

Example m/n = 8 Opt k = 8 ln 2 = Ack Mitzenmacher © 2015 Bruno Ribeiro