A survey on stream data mining

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

An Optimal Algorithm for the Distinct Elements Problem

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.

Mining Data Streams (Part 1)

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Analysis of Algorithms

Fast Algorithms For Hierarchical Range Histogram Constructions

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.

1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

Mining Data Streams.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.

1 More Stream-Mining Counting How Many Elements Computing “Moments”

More Stream-Mining Counting Distinct Elements Computing “Moments”

1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.

1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.

1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.

Database Management 9. course. Execution of queries.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

CSC 211 Data Structures Lecture 13

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Data Stream Algorithms Lower Bounds Graham Cormode

Sampling for Windows on Data Streams by Vladimir Braverman

Calculating frequency moments of Data Stream

Mining of Massive Datasets Ch4. Mining Data Streams

June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.

Mining of Massive Datasets Ch4. Mining Data Streams.

DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Mining Data Streams (Part 1)

The Stream Model Sliding Windows Counting 1’s

Web-Mining Agents Stream Mining

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Finding Frequent Items in Data Streams

CS6234 Advanced Algorithms February

Streaming & sampling.

Mining Data Streams (Part 1)

Chapter 15 QUERY EXECUTION.

Counting How Many Elements Computing “Moments”

Mining Data Streams (Part 2)

Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Range-Efficient Computation of F0 over Massive Data Streams

Heavy Hitters in Streams and Sliding Windows

Minwise Hashing and Efficient Search

Maintaining Stream Statistics over Sliding Windows

Presentation transcript:

A survey on stream data mining

Roadmap The basic model of the stream data mining Counting bit problem Basic idea Exponentially increasing region DGIM method Counting distinct element Flajolet-Martin approach Calculating how “uneven” the elements in the stream are The idea of “moment” and AMS method

Basic model of stream data Data input rapidly The system cannot store entire data Queries tend to ask information about recent data The scan never “turn back”

Basic model of stream data Queries (command) …,a,a,b,a,d,c,c,b,c Processor …,1,0,0,1,1,1,0,1,0 Output …,3,0,1,1,2,3,1,0,2 Input streams Limited storage

Applications Is there any telephone calls from a certain department of the company to the other department in the past 5 minutes? Which channels are the most popular ones in the past 30 minutes? The answers to this kind of queries are varied over time

Sliding windows A mechanism that stores the most recent N elements of the stream N: window size N may be too large to store the entire stream in the system Window size: N Timestamps 7 6 5 4 3 2 1 Arrival time 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 Elements N

Counting bit problem How many 1s in the recent k bits? (given that a stream contains only 0s and 1s) Stores the latest N bits (when N>=k) Advantage: accurate answer Drawback: Storage space (when N is too small or k is too large…) Response time?

Fix-up 1: exponentially increasing region 1001010110001011010101010101011010101010101110101010111010100010110010 1 2 4 8 16 32 N buckets ? 7 9 5 5 1 3 1 1001010110001011010101010101011010101010101110101010111010100010110010 N

Bucket update http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=246 32 32 16 8 4 4 2 1 1 1 32 32 16 8 4 4 2 2 1 32 32 16 8 4 4 2 2 1 1 32 32 16 8 4 4 2 2 1 1 1 32 32 16 8 4 4 2 2 2 1 32 32 16 8 4 4 4 2 1 32 32 16 8 8 4 2 1 32 32 16 8 4 4 2 1 1 http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=246 http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=252

Fix-up 2: DGIM* method Representing buckets The error of the last part is smaller Update method is similar 1001010110001011010101010101011010101010101110101010111010100010110010 N 1 bucket of size 2 2 buckets of size 4 of size 8 At least 1 of size 16. Partially beyond window. of size 1 *: Datar, Gionis, Indyk, and Motwani

Counting distinct elements How many different web pages does a customer request last week? How many different channels does a customer watch yesterday? What if we don’t have enough space to store the complete set?

Flajolet-Martin approach (1/4) A probabilistic counting algorithm Used to estimate number of distinct elements in a large file originally Use little memory Single pass only Based on statistical observation made on bits of hashed values

Flajolet-Martin approach (2/4) Hash function h: map n elements to log2n bits uniformly bit(y, k) = kth bit in the binary representation of y if y>0 if y=0

Flajolet-Martin approach (3/4) for (i:=0 to L-1) do BITMAP[i]:=0; for (all x in M) do begin index:=ρ(h(x)); if BITMAP[index]=0 then BITMAP[index]:=1; end R := the largest index in BITMAP whose value equals to 1 Estimate := 2R

Flajolet-Martin approach (4/4) If the final BITMAP looks like this: 0000,0000,1100,1111,1111,1111 The left most 1 appears at position 15 We say there are around 215 distinct elements in the stream

Moment Let mi be the number of times value i occurs in a stream The kth moment is the sum of (mi)k for all i 0th moment: the problem we just considered 1st moment: length of the stream 2nd moment: measure how uneven the distribution is (surprise number) 5,5,5,5,5  surprise number = 125 9,9,5,1,1  surprise number = 189

AMS* method Works for all moments Ex: (stream length n ,2nd moment: ) X=n*((twice the number of as in the stream starting at the chosen time) – 1) E(X)=(1/n)*(Σall times t of n*(twice the number of times the stream element at time t appears from that time on)-1) =Σa (1/n)(n)(1+3+5+…+2ma-1) =Σa(ma)2 (= the 2nd moment) Compute as many variables X as can fit in available memory *: Alon, Matias, and Szegedy

Conclusion Under stream data model… Basic counting (0s and 1s only) Fix-ups to basic counting Exponentially increasing region DGIM method Distinct element counting How “uneven” of the distribution

Discussion There seems no arbitrary token counting algorithm under stream data mining model yet…

References Data mining course in Stanford: http://www.stanford.edu/class/cs345a/ Stanford InfoLab hompage: http://www-db.stanford.edu/ Maintaining stream statistics over sliding windows, ACM SIAM Journal on Computing 2002 Maintaining variance and k-medians over data stream windows, ACM PODS 2003 Probabilistic counting algorithms for data base applications, Journal of Computer and System Sciences 1985 The space complexity of approximating the frequency moments, ACM Symposium on Theory of Computing 1996

Examples of bit(y, k) & ρ(y) bit(y,0)=0 bit(y,1)=1 bit(y,2)=0 bit(y,3)=1 int y binary format ρ(y) 0000 4 (=L) 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111 8 1000