CS591A1 Fall 20021 Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
A Fast and Compact Method for Unveiling Significant Patterns in High-Speed Networks Tian Bu 1, Jin Cao 1, Aiyou Chen 1, Patrick P. C. Lee 2 Bell Labs,
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Radial Basis Function (RBF) Networks
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
06/10/2015Applied Algorithmics - week81 Combinatorial Group Testing  Much of the current effort of the Human Genome Project involves the screening of.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
CEDAR Counter-Estimation Decoupling for Approximate Rates Erez Tsidon Joint work with Iddo Hanniel and Isaac Keslassy Technion, Israel 1.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
Data Mining: Concepts and Techniques Mining data streams
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Calculating frequency moments of Data Stream
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Dense-Region Based Compact Data Cube
SketchVisor: Robust Network Measurement for Software Packet Processing
Mining Data Streams (Part 1)
Data Transformation: Normalization
Module 11: File Structure
The Stream Model Sliding Windows Counting 1’s
The Variable-Increment Counting Bloom Filter
Finding Frequent Items in Data Streams
Streaming & sampling.
Query-Friendly Compression of Graph Streams
Disk Storage, Basic File Structures, and Buffer Management
Optimal Elephant Flow Detection Presented by: Gil Einziger,
Qun Huang, Patrick P. C. Lee, Yungang Bao
File Storage and Indexing
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Analysis of Algorithms
Network-Wide Routing Oblivious Heavy Hitters
Heavy Hitters in Streams and Sliding Windows
Data Pre-processing Lecture Notes for Chapter 2
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma

CS591A1 Fall Motivation Data arrives in Streams - Item by Item Summarize data streams in one pass to answer point or range queries on streams - Back tracking usually not possible Processing time should be small Storage requirements should be minimum

CS591A1 Fall Example Telephone call Records,,,, Arriving item by item Need,, Only one pass allowed

CS591A1 Fall Obstacles Do not know the arrival model in advance Processing time usually cannot match the speed at which items arrive Storage is limited - Not possible to keep per item information for all the items

CS591A1 Fall Outline Related Work Data Streaming Models Sketch based Summarization Computing a Sketch Networking Applications Issues Final thoughts

CS591A1 Fall Related Work Streaming models are very common [Alon, Strauss, Indyk] Need for processing streaming data has been emphasized earlier in the context of Data Mining [Domingos, Raghavan] Recent work is under progress at Stanford (

CS591A1 Fall Related Work Sketching technique [Gilbert et. al.] QuickSand: Summarizing Network data [Gilbert et. al.] Sampling methods [Gibbons et. al.] - Iceberg queries Wavelet Transforms [Gilbert, Matias]

CS591A1 Fall Datastreaming Model Items arrive sequentially in a stream Stream is usually described as a signal Definition: A signal is a function f f : [0…(N-1)] Z+ Where N is the size of the domain and Z is all non – negative integers

CS591A1 Fall Datastreaming Models Types Cash RegisterAggregate Ordered Unordered

CS591A1 Fall Cash Register Model Ordered Cash register model: items arrive in increasing/decreasing order of the domain values e.g.,,,, Unordered Cash register model:,,,,

CS591A1 Fall Aggregate Model Ordered Aggregate model: e.g.,, Unordered Aggregate model: e.g.,,

CS591A1 Fall Stream Processing Data stream needs to be summarized online as each data item arrives We are allowed a certain amount of processing memory Summary should be amenable to answering queries

CS591A1 Fall Sketch based Summarization A sketch of a signal is defined as a small space representation of the original signal It relies on keeping an estimate of the underlying signal such that a linear projection of the signal can be estimated Accuracy of estimating a query will depend on the contribution of the queried item with respect to the rest of the items

CS591A1 Fall What can we do with a sketch? Point queries: “In how many minutes of outgoing calls was a particular phone number involved?” Range queries: “How many total minutes of call were handled by a particular xxx-yyy combination?” Sketch can be used for answering point and range queries

CS591A1 Fall Storage Requirements For a signal of domain size N, the space required to store the sketch is log O(1) (N) Size of the sketch is dependent on the available memory Accuracy of answering queries can be boosted by increasing the size of the sketch

CS591A1 Fall Bounds on Sketches For a point query, a sketch of size provides an answer within a factor of the correct answer with probability at least provided that the cosine between the original sketch vector and the sketch of the query vector is greater than where, is the distortion parameter, is the failure probability and is referred to as the failure threshold

CS591A1 Fall Computing a Sketch Assume domain size is N and size of the sketch is J. A sketch is a set of J atomic sketches, where an atomic sketch is the inner product of the vector v of size N with a random vector of the same size Random vector is comprised of binary random variables which have values {1, -1} Need (J x N ) random variables

CS591A1 Fall Updating the sketch Let A[i] represent the value of data item corresponding to item id I Let S[j] represent the j th atomic sketch Let R[j][i] represent the binary random variable corresponding to j th atomic sketch and i th id

CS591A1 Fall Updating the Sketch STEP 1: For all j, S[j] = 0; STEP 2: A[i] arrives in a data stream, For all j S[j] = S[j] + A[i]*R[j][i] STEP 3: Repeat STEP 2 until the data stream ends or the accounting interval expires

CS591A1 Fall Storing Random Variables For large values of N, J is large Explicitly storing (J x N) random variables in memory requires extensive memory Defeats the purpose of small space representation of data streams Efficient and fast generation of random variables is desired Generating k-wise random variables is suggested

CS591A1 Fall Query Processing Processing a point query: To answer a query about the aggregate value of data item i, Form a vector Q of size N s.t. Q has 1 at id i and 0 elsewhere Compute the sketch of vector Q using the same methodology Let the sketch of Q be represented by vector E

CS591A1 Fall Query Processing Claim: The dot product of vector E with vector S gives an approximation of the aggregated value of the queried id If the actual value is A[i], the value returned after performing the query is A[i]

CS591A1 Fall Outline Related Work Data Streaming Models Sketch based Summarization Computing a Sketch Networking Applications Issues Final thoughts

CS591A1 Fall Applications Network monitoring and traffic engineering Telecom call records Financial applications Sensor networks Web logs and click streams

CS591A1 Fall Network Applications Summarizing network data at vantage points – Estimating POP level traffic demands – Performing similarity tests Transferring summaries – Small size of the sketch causes little overhead on the network Identifying heavy hitters [QuickSand]

CS591A1 Fall Problem Definition Sequence of packets arriving over a fixed window of time Dest. IP address is the ID for a packet N is the total number of IDs, M is the bounded memory ( M << N) Obtain one pass summaries of the data stream for the fixed time interval Summaries should be able to support point and range queries

CS591A1 Fall Methodology Trace driven simulations N is the total number of destinations, number of atomic sketches is O(log 2 N) All atomic sketches are updated on arrival of each packet Queries computed for top-20 destinations Sample trace statistics: - Number of packets = 3.5 M - Number of destinations = 45,000 - Sketch sizes = 250 and 500

CS591A1 Fall Results

CS591A1 Fall Some comparisons… Maintaining individual counters for N = 45K Total memory required = 360 KB Using a Sketch of size 500 for N = 45 K Total memory required = 4 KB (assuming size of a double = 8 Bytes)

CS591A1 Fall Insignificant Byte Counts Sketches respond well to queries which seek dominant features in the incoming data stream Idea: Sketch Partitioning [Dobra et. al. SIGMOD 2002] Partition the incoming stream into classes Maintain a separate sketch for each class

CS591A1 Fall Deployment Issues Implementing such a scheme in routers will steal CPU cycles from routers Idea: - Adopt a packet sniffer architecture [IP traceback] - Packet headers are read by an attached auxiliary unit as each packet departs the router - Produces negligible overheads on the router

CS591A1 Fall Measuring at High Speeds In high speed networks, it is infeasible to process every packet from a data stream Atomic sketches may overflow very quickly at high speeds Idea: - Use sampling (systematic or per byte basis) - Paging information from SRAM to SDRAM

CS591A1 Fall Summary Capturing data streams using small space is fundamental Processing speed and space requirements pose significant hurdles Sketches summarize the signal represented by the data stream in small space Aggregate queries on the signal can be answered with reasonable accuracy

CS591A1 Fall Summary Sketch based summarization of network data provides possible solutions to traffic accounting problem The method can eliminate the deployment of external infrastructure to monitor network traffic Strong probabilistic guarantees can be provided with appropriate settings of the sketch parameters Such appropriate settings need to be quantified

CS591A1 Fall Thank You