1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Mining Data Streams (Part 1)
1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Mining Data Streams.
1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
1 Association Rules Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Extension of DGIM to More Complex Problems
Hash Table indexing and Secondary Storage Hashing.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
CURE Algorithm Clustering Streams
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
A survey on stream data mining
1 More Stream-Mining Counting How Many Elements Computing “Moments”
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
More Stream-Mining Counting Distinct Elements Computing “Moments”
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
1 Still More Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
The Study of Computer Science Chapter 0 Intro to Computer Science CS1510, Section 2.
The Study of Computer Science Chapter 0 Intro to Computer Science CS1510.
CMPE 421 Parallel Computer Architecture
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Approximating Hit Rate Curves using Streaming Algorithms Nick Harvey Joint work with Zachary Drudi, Stephen Ingram, Jake Wires, Andy Warfield TexPoint.
IT253: Computer Organization
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
CS 232: Computer Architecture II Prof. Laxmikant (Sanjay) Kale.
Mining of Massive Datasets Ch4. Mining Data Streams
Searching Topics Sequential Search Binary Search.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
Searching CSE 103 Lecture 20 Wednesday, October 16, 2002 prepared by Doug Hogan.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Mining Data Streams (Part 1)
What about streaming data?
Computer Organization
The Stream Model Sliding Windows Counting 1’s
The Stream Model Sliding Windows Counting 1’s
Web-Mining Agents Stream Mining
Mining Frequent Patterns from Data Streams
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
CS6234 Advanced Algorithms February
Mining Data Streams The Stream Model Sliding Windows Counting 1’s Exponentially Decaying Windows Usually, we mine data that sits somewhere in a database.
Streaming & sampling.
Mining Data Streams (Part 1)
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Counting How Many Elements Computing “Moments”
Mining Data Streams (Part 2)
Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Range-Efficient Computation of F0 over Massive Data Streams
Data Intensive and Cloud Computing Data Streams Lecture 10
Introduction to Stream Computing and Reservoir Sampling
Test Drop Rules: If not:
Software Development Techniques
Maintaining Stream Statistics over Sliding Windows
Counting Bits.
Presentation transcript:

1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s

2 The Stream Model uData enters at a rapid rate from one or more input ports. uThe system cannot store the entire stream. uHow do you make critical calculations about the stream using a limited amount of (secondary) memory?

3 Processor Limited Storage... 1, 5, 2, 7, 0, 9, 3... a, r, v, t, y, h, b... 0, 0, 1, 0, 1, 1, 0 time Streams Entering Queries Output

4 Applications --- (1) uIn general, stream processing is important for applications where wNew data arrives frequently. wImportant queries tend to ask about the most recent data, or summaries of data.

5 Applications --- (2) uMining query streams. wGoogle wants to know what queries are more frequent today than yesterday. uMining click streams. wYahoo wants to know which of its pages are getting an unusual number of hits in the past hour.

6 Applications --- (3) uSensors of all kinds need monitoring, especially when there are many sensors of the same type, feeding into a central controller, most of which are not sensing anything important at the moment. uTelephone call records summarized into customer bills.

7 Applications --- (4) uIntelligence-gathering. wLike “evil-doers visit hotels” at beginning of course, but much more data at a much faster rate. Who calls whom? Who accesses which Web pages? Who buys what where?

8 Sliding Windows uA useful model of stream processing is that queries are about a window of length N --- the N most recent elements received. uInteresting case: N is so large it cannot be stored in memory, or even on disk. wOr, there are so many streams that windows for all cannot be stored.

9 q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future

10 Counting Bits --- (1)  Problem: given a stream of 0’s and 1’s, be prepared to answer queries of the form “how many 1’s in the last k bits?” where k ≤ N. uObvious solution: store the most recent N bits. wWhen new bit comes in, discard the N +1 st bit.

11 Counting Bits --- (2) uYou can’t get an exact answer without storing the entire window. uReal Problem: what if we cannot afford to store N bits? wE.g., we are processing 1 billion streams and N = 1 billion, but we’re happy with an approximate answer.

12 Something That Doesn’t (Quite) Work uSummarize exponentially increasing regions of the stream, looking backward. uDrop small regions if they begin at the same point as a larger region.

13 Example N We can construct the count of the last N bits, except we’re Not sure how many of the last 6 are included. ? 6

14 What’s Good? uStores only O(log 2 N ) bits. wO(log N ) counts of log 2 N bits each. uEasy update as more bits enter. uError in count no greater than the number of 1’s in the “unknown” area.

15 What’s Not So Good? uAs long as the 1’s are fairly evenly distributed, the error due to the unknown region is small --- no more than 50%. uBut it could be that all the 1’s are in the unknown area at the end. uIn that case, the error is unbounded.

16 Fixup uInstead of summarizing fixed-length blocks, summarize blocks with specific numbers of 1’s. wLet the block “sizes” (number of 1’s) increase exponentially. uWhen there are few 1’s in the window, block sizes stay small, so errors are small.

17 DGIM* Method uStore O(log 2 N ) bits per stream. uGives approximate answer, never off by more than 50%. wError factor can be reduced to any fraction > 0, with more complicated algorithm and proportionally more stored bits. *Datar, Gionis, Indyk, and Motwani

18 Timestamps uEach bit in the stream has a timestamp, starting 1, 2, … uRecord timestamps modulo N (the window size), so we can represent any relevant timestamp in O(log 2 N ) bits.

19 Buckets uA bucket in the DGIM method is a record consisting of: 1.The timestamp of its end [O(log N ) bits]. 2.The number of 1’s between its beginning and end [O(log log N ) bits]. uConstraint on buckets: number of 1’s must be a power of 2. wThat explains the log log N in (2).

20 Representing a Stream by Buckets uEither one or two buckets with the same power-of-2 number of 1’s. uBuckets do not overlap in timestamps. uBuckets are sorted by size (# of 1’s). wEarlier buckets are not smaller than later buckets. uBuckets disappear when their end-time is > N time units in the past.

21 Example N 1 of size 2 2 of size 4 2 of size 8 At least 1 of size 16. Partially beyond window. 2 of size 1

22 Updating Buckets --- (1) uWhen a new bit comes in, drop the last (oldest) bucket if its end-time is prior to N time units before the current time. uIf the current bit is 0, no other changes are needed.

23 Updating Buckets --- (2) uIf the current bit is 1: 1.Create a new bucket of size 1, for just this bit. uEnd timestamp = current time. 2.If there are now three buckets of size 1, combine the oldest two into a bucket of size 2. 3.If there are now three buckets of size 2, combine the oldest two into a bucket of size 4. 4.And so on…

24 Example

25 Querying uTo estimate the number of 1’s in the most recent N bits: 1.Sum the sizes of all buckets but the last. 2.Add in half the size of the last bucket. uRemember, we don’t know how many 1’s of the last bucket are still within the window.

26 Error Bound uSuppose the last bucket has size 2 k. uThen by assuming 2 k -1 of its 1’s are still within the window, we make an error of at most 2 k -1. uSince there is at least one bucket of each of the sizes less than 2 k, the true sum is no less than 2 k -1. uThus, error at most 50%.

27 Extensions (For Thinking) uCan we use the same trick to answer queries “How many 1’s in the last k ?” where k < N ? uCan we handle the case where the stream is not bits, but integers, and we want the sum of the last k ?