Mining Data Streams AMCS/CS 340 : Data Mining Xiangliang Zhang

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
Mining Data Streams.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Date : 21 st of May, Shri Ramdeo Baba College of Engineering and Management Presentation By : Rimjhim Singh Under the Guidance of: Dr. M.B. Chandak.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Heavy hitter computation over data stream
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Stream Data Introduction or “Stream Data in 30 minutes or less…” Magdiel Galán CSE591: DataMining Dr. Huan Liu Spring 2004.
1 Mining Decision Trees from Data Streams Tong Suk Man Ivy CSIS DB Seminar February 12, 2003.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
A survey on stream data mining
Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Data Mining: Concepts and Techniques
Ch5 Mining Frequent Patterns, Associations, and Correlations
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Stream Data Introduction
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Data Warehousing Mining & BI Data Streams Mining DWMBI1.
1 Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU.
Data Mining: Concepts and Techniques Mining data streams
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
Stream Data Mining JIAWEI HAN COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN JULY 20,
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Bias Management in Time Changing Data Streams We assume data is generated randomly according to a stationary distribution. Data comes in the form of streams.
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
Mining Frequent Patterns from Data Streams
Data Mining: Concepts and Techniques — Chapter 8 — 8. 1
Data Mining for Data Streams
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
Data Mining: Concepts and Techniques Course Outline
Data Mining: Concepts and Techniques — Chapter 8 — 8. 1
Mining Unusual Patterns in Data Streams in Multi-Dimensional Space
I don’t need a title slide for a lecture
Smita Vijayakumar Qian Zhu Gagan Agrawal
Approximate Frequency Counts over Data Streams
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Online Analytical Processing Stream Data: Is It Feasible?
Learning from Data Streams
Frequent Pattern Mining for Data Streams
Presentation transcript:

Mining Data Streams AMCS/CS 340 : Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Outline Introduction of Data Streams Synopsis/sketch maintenance Sampling Sliding window Counting Distinct Elements Frequent pattern mining Stream Clustering Stream Classification Change and novelty detection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Motivations A large number of applications generate data streams – Telecommunication (call records) – System management (network events) – Surveillance (sensor network, audio/video) – Financial market (stock exchange) – Day to day business (credit card, ATM transactions, etc) Tasks: Real time query answering, statistics maintenance, and pattern discovery on data streams Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 3

Characteristics of Data Streams High volume (possibly infinite) of continuous data Data arrive at a rapid rate Data distribution changes on the fly The system cannot store the entire stream (only the summary of the data seen thus far) calculations about the stream should be done in a limited amount of (secondary) memory Data streams — continuous, ordered, changing, fast, huge amount Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 4

Example: Network Management Application IP session data (collected using Cisco NetFlow) AT&T collects 100 GBs of NetFlow data each day! Source Destination Duration Bytes Protocol 10.1.0.2 16.2.3.7 12 20K http 18.6.7.1 12.4.0.3 16 24K http 13.9.4.3 11.6.8.2 15 20K http 15.2.2.9 17.1.2.1 19 40K http 12.4.3.8 14.8.7.4 26 58K http 10.5.1.3 13.0.0.1 27 100K ftp 11.1.0.6 10.3.4.5 32 300K ftp 19.7.1.2 16.5.5.8 18 80K ftp Network Operations Center Measurements Alarms Network 5

Example: Network Management Application Data stream processing in network management Monitor link bandwidth usage, estimate traffic demands How many bytes were sent between a pair of IP addresses? What fraction network IP addresses are active? List the top 100 IP addresses in terms of traffic Quickly detect faults, congestion and isolate root cause List all sessions that transmitted more than 1000 bytes Identify all sessions whose duration was more than twice the normal List all IP addresses that have witnessed a sudden spike in traffic Load balancing, improve utilization of network resources Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining More Applications Mining query streams Google wants to know what queries are more frequent today than yesterday. Mining click streams Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour. Mining social network news feeds Look for trending topics on Twitter, Facebook Mining call records Summarize telephone call records into customer bills. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 7

Stream processing requirements Single pass: Each record is examined at most once Bounded storage: Limited Memory (M) for storing synopsis Real-time: Per record processing time (to maintain synopsis) must be low Queries / Statistics / Classification / Clustering Processor . . . 1, 5, 2, 7, 0, 9, 3 . . . a, r, v, t, y, h, b . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering Output Limited Storage 8

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Outline Introduction of Data Streams Synopsis/sketch maintenance Sampling Sliding window Counting Distinct Elements Frequent pattern mining Stream Clustering Stream Classification Change and novelty detection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 9

Sampling from a data stream Can not store the entire stream ?  store a sample Two different problems: Sample a fixed proportion of elements in the stream (say 1 in 10) Maintain a random sample of fixed size over a potentially infinite stream a b c d d e e f f g g ---- 1/11 , 2/11 a c d e f g --- 1/6, 1/6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 10

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Sliding Windows A useful model of stream processing is that queries are about a window of length N --- the N most recent elements received. q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future Maintaining statistics – Count/Sum of non-zero elements – Variance Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 11

Counting Distinct Elements Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Obvious approach: maintain the set of elements seen. Application: How many different Web pages does each customer request in a week? How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate artificial pages (spam made for influence search engine rankings?). Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 12

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Using Small Storage Real Problem: what if we do not have space to store the complete set? Estimate the count in an unbiased way. Accept that the count may be in error, but limit the probability that the error is large. Flajolet-Martin Approach [FM85] Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 13

Frequent Pattern Mining Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93] Frequent pattern mining: finding regularities in data – What products were often purchased together? – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we classify web documents based on key-word combinations? Apriori Algorithm Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 14

Frequent Pattern in Data Streams – Challenges Maintaining exact counts for all (frequent) itemsets needs multiple scans of the stream – Maintain approximation of counts Finding the exact set of frequent itemsets from data streams cannot be online – Have to scan data streams multiple times – Space overhead – Finding approximation of set of frequent itemsets Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 15

Mining Approximate Frequent Patterns Approximate answers are often sufficient (e.g., trend/pattern analysis) Example: a router is interested in all flows: whose frequency is at least σ (e.g., 10%) of the entire traffic stream seen so far and feels that 1/10 of σ (ε = 0.1* σ) error is comfortable How to mine frequent patterns with good approximation? Lossy Counting Algorithm (Manku & Motwani, VLDB’02) Major ideas: not tracing items until it becomes frequent Advantage: guaranteed error bound Disadvantage: keep a large set of traces Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 16

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Lossy Counting – Ideas Divide the stream into buckets, maintain a global count of buckets seen so far For any item, if its count is less than the global count of buckets, then its count does not need to be maintained – How to divide buckets so that the possible errors are bounded? – How to guarantee the number of entries needed to be recorded is also bounded? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 17

Lossy Counting for Frequent Items Bucket 1 Bucket 2 Bucket 3 Divide Stream into ‘Buckets’ (bucket size is 1/ ε = 10) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 18

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining First Bucket of Stream Empty (summary) + After a bucket, decrease all counters by 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 19

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Next Bucket of Stream + After a bucket, decrease all counters by 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 20

Approximation Guarantee Given: (1) support threshold: σ, (2) error threshold: ε, and (3) stream length N Output: items with frequency counts exceeding (σ – ε) N How much do we undercount? If stream length seen so far = N and bucket-size = 1/ε then frequency count error  #buckets = εN Approximation guarantee No false negatives (freq but not reported) False positives have true frequency count at least (σ–ε)N Frequency count underestimated by at most εN Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 21

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Outline Introduction of Data Streams Synopsis/sketch maintenance Sampling Sliding window Counting Distinct Elements Frequent pattern mining Stream Clustering Stream Classification Change and novelty detection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 22

Clustering Data Streams Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 23

CluStream: Clustering On-line Streams Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 24

CluStream: Clustering On-line Streams Online micro-cluster maintenance Initial creation of q micro-clusters q is usually significantly larger than the number of natural clusters Online incremental update of micro-clusters If new point is within max-boundary, insert into the micro-cluster Otherwise, create a new cluster May delete obsolete micro-cluster or merge two closest ones Offline query-based macro-clustering Based on a user-specified time-horizon h and the number of macro-clusters K, compute macro-clusters using clustering algorithm, e.g. k-means, DbScan. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 25

Clustering Streams, Model + Reservoir Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 26

Clustering Streams, Model + Reservoir Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 27

Clustering Streams, Model + Reservoir Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 28

Clustering Streams, Model + Reservoir Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 29

Clustering Streams, Model + Reservoir Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 30

Clustering Streams, Model + Reservoir Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 31

Clustering Streams, Model + Reservoir 32

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Summary – Clustering • Clustering data stream with one scan and limited main memory – Clustering in a sliding window – Clustering the whole stream (online) • How to handle evolving data? – Online summarization and offline analysis – Change detection • Applications and extensions – Outlier detection, nearest neighbor search, reverse nearest neighbor queries, … Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 33

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Outline Introduction of Data Streams Synopsis/sketch maintenance Sampling Sliding window Counting Distinct Elements Frequent pattern mining Stream Clustering Stream Classification Change and novelty detection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 34

Classification for Dynamic Data Streams Decision tree induction for stream data classification VFDT (Very Fast Decision Tree) / CVFDT (Domingos, Hulten, Spencer, KDD00/KDD01) Is decision-tree good for modeling fast changing data, e.g., stock market analysis? Other stream classification methods Instead of decision-trees, consider other models Naïve Bayesian Ensemble (Wang, Fan, Yu, Han. KDD’03) K-nearest neighbors (Aggarwal, Han, Wang, Yu. KDD’04) incremental updating, dynamic maintenance, and model construction 35

What are the Challenges? Data Volume – impossible to mine the entire data at one time – can only afford constant memory per data sample Concept Drifts – previously learned models are invalid Cost of Learning – model updates can be costly – can only afford constant time per data sample 36

The Decision Tree Classifier Learning (Training) : – Input: a data set of (X, y), where X is a vector, y a class label – Output: a model (decision tree) Testing: – Input: a test sample (x, ?) – Output: a class label prediction for x 37

The Decision Tree Classifier A divide-and-conquer approach – Simple algorithm, intuitive model Compute information gain for data in each node – Super-linear complexity Typically a decision tree grows one level for each scan of data – Multiple scans are required The data structure is not ‘stable’ – Subtle changes of data can cause global changes in the data structure 38

Challenge for streams Task: Intuition: – Given enough samples, can we build a tree in constant time that is nearly identical to the tree a batch learner (C4.5, Sprint, etc.) would build? Intuition: – With increasing # of samples, the # of possible decision trees becomes smaller Forget about concept drifts for now. 39

Decision-Tree Induction with Data Streams Packets > 10 Data Stream yes no At each node, we shall accumulate enough samples (n) before we make a split Problem: How many examples are necessary? n=? Protocol = http Packets > 10 Data Stream yes no Bytes > 60K Protocol = http yes Protocol = ftp Ack. From Gehrke’s SIGMOD tutorial slides

Hoeffding Bound Given – r : real valued random variable – n : # independent observations of r – R : range of r The difference between r and ravg is bounded by ε, with probability 1-δ, P( |μr - ravg| ≥ ε) < 1-δ and 41

Hoeffding Bound P( |μr - ravg| ≥ ε) < 1-δ and Properties: – Hoeffding bound is independent of data distribution – Error ε decreases when n (# of samples) increases Hoeffding Tree, based on Hoeffding Bound principle At each node, we shall accumulate enough samples (n) before we make a split When n is large enough, error ε decreases to a small value 42

Hoeffding Tree Algorithm Hoeffding Tree Input S: sequence of examples X: attributes G( ): evaluation function, e.g., Gini gain Hoeffding Tree Algorithm for each example in S retrieve G(Xa) and G(Xb) //two highest G(Xi) if ( G(Xa) – G(Xb) > ε ) ε is computed by using hoeffding bound split on Xa recursively go to next node 43

Hoeffding Tree: Pros and Cons Scales better than traditional DT algorithms – Incremental – Sub-linear with sampling – Small memory requirement Cons: – Only consider top 2 attributes – Tie breaking takes time – Grow a deep tree takes time 44

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Outline Introduction of Data Streams Synopsis/sketch maintenance Sampling Sliding window Counting Distinct Elements Frequent pattern mining Stream Clustering Stream Classification Change and novelty detection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 45

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Change detection General idea: compare a reference distribution with a current window of events reference distribution Kullback-Leibler distance can be used to measure the difference between two given distributions [Dasu et al, 2006] Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 46

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Change detection General idea: compare a reference distribution with a current window of events reference distribution Density statistic test can be used to test whether the newly observed data points S0 are sampled from the underlying distribution that produced the baseline data set S. [Song et al, 2007] Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 47

Issue of reference window in change detection General idea: compare a reference distribution with a current window of events Based on the stationary reference data What if the underlying distribution is not stationary ? e.g. in network intrusion detection by monitoring network traffic, the distribution of reference data (usually normal data) is evolving over time reference distribution Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 48

Summary: Stream Data Mining Stream data mining: A rich and on-going research field Research in database community: DSMS system architecture, continuous query processing, supporting mechanisms Stream data mining Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 49

References on Stream Data Mining (1) C. Aggarwal, J. Han, J. Wang, P. S. Yu. A Framework for Clustering Data Streams,  VLDB'03 C. C. Aggarwal, J. Han, J. Wang and P. S. Yu. On-Demand Classification of Evolving Data Streams, KDD'04 C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A Framework for Projected Clustering of High Dimensional Data Streams, VLDB'04 S. Babu and J. Widom. Continuous Queries over Data Streams. SIGMOD Record, 2001 B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom. Models and Issues in Data Stream Systems”, PODS'02.  (Conference tutorial) Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. "Multi-Dimensional Regression Analysis of Time-Series Data Streams, VLDB'02 P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00 A. Dobra, M. N. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams, SIGMOD’02 J. Gehrke, F. Korn, D. Srivastava. On computing correlated aggregates over continuous data streams.  SIGMOD'01 C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu. Mining frequent patterns in data streams at multiple time granularities, Kargupta, et al. (eds.), Next Generation Data Mining’02 Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani. Maintaining stream statistics over sliding windows. SODA 2002

References on Stream Data Mining (2) S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data Streams, FOCS'00 G. Hulten, L. Spencer and P. Domingos: Mining time-changing data streams. KDD 2001 S. Madden, M. Shah, J. Hellerstein, V. Raman, Continuously Adaptive Continuous Queries over Streams, SIGMOD02 G. Manku, R. Motwani.  Approximate Frequency Counts over Data Streams, VLDB’02 A. Metwally, D. Agrawal, and A. El Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. ICDT'05 S. Muthukrishnan, Data streams: algorithms and applications, Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 2003 R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge Univ. Press, 1995 S. Viglas and J. Naughton, Rate-Based Query Optimization for Streaming Information Sources, SIGMOD’02 Y. Zhu and D. Shasha.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time, VLDB’02 H. Wang, W. Fan, P. S. Yu, and J. Han, Mining Concept-Drifting Data Streams using Ensemble Classifiers, KDD'03

Acknowledgements Some of the material is borrowed from lectures of Jiawei Han and Micheline Kamber Minos Garofalakis, Johannes Gehrke and Rajeev Rastogi Jure Leskovec and Anand Rajaraman Haixun Wang, Jian Pei, and Philip S. Yu 52