Learning from Data Streams

Learning from Data Streams
CS240B Notes by Carlo Zaniolo UCLA CSD With slides from a ICDE 2005 tutorial by Haixun Wang, Jian Pei & Philip Yu

What are the Challenges?
Data Volume impossible to mine the entire data at one time can only afford constant memory per data sample Concept Drifts previously learned models are invalid Cost of Learning model updates can be costly can only afford constant time per data sample.

On-Line Learning Learning (Training) : Testing:
Input: a data set of (a, b), where a is a vector, b a class label Output: a model (decision tree) Testing: Input: a test sample (x, ?) Output: a class label prediction for x When mining data streams the two phases are often combined since concept shift requires continuous training.

Mining Data Streams: Challenges
On-line response (NB), limited memory, most recent windows only Fast & Light algorithms needed that Minimize usage of memory and CPU Require only one (or a few) passes through data Concept shift/drift: change mining set statistics Render previously learned models inaccurate or invalid Robustness and Adaptability: quickly recover/adjust after concept changes. Popular machine learning algorithms no longer effective: Neural nets: slow learner requires many passes Support Vector Machines (SVM): computationally too expensive.

Classifiers Algorithms: from databases to data streams
New Algorithm have emerged, Bloom Filters, ANNCAD Existing algorithms have adapted, NBC survives with only minor changes, Decision Trees require significant adaptation, Classifier ensembles remain effective—after significant changes Popular algorithms no longer effective: Neural nets: slow learner requires many passes Support Vector Machines (SVM): computationally too expensive

Decision Tree Classifiers
A divide-and-conquer approach Simple algorithm, intuitive model Typically a decision tree grows one level for each scan of data Multiple scans are required But if we can use small samples these problem disappears But data structure is not ‘stable’ Subtle changes of data can cause global changes in the data structure

Challenge #1 How many samples do we need to build a tree in constant time that is nearly identical to the tree built by batch learner (C4.5, Sprint,...)? Nearly identical? Categorical attributes: with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner identical decision tree Continuous attributes: discretize them into categorical ones ...Forget concept shift/drift for now

Hoeffding Bound Also known as additive Chernoff Bound Given
– r : real valued random variable – n : # independent observations of r – R : range of r Mean of r is at least ravg- ε, with probability 1-δ, i.e.: P(μr ≥ ravg- ε) is 1-δ where:

Hoeffding Bound Properties:
Hoeffding bound is independent of data distribution Error ε decreases when n (# of samples) increases At each node, we shall accumulate enough samples (n) before we make a split.

Categorical attributes:
Nearly Identical? Categorical attributes: with high probability, the attribute we choose for split is the same attribute as would be chosen by a batch learner thus we seek identical decision tree Continuous attributes:discretize them into categorical ones.

Learning from Data Streams

Similar presentations

Presentation on theme: "Learning from Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning from Data Streams

Similar presentations

Presentation on theme: "Learning from Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback