Presentation is loading. Please wait.

Presentation is loading. Please wait.

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.

Similar presentations


Presentation on theme: "A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois."— Presentation transcript:

1 A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

2 Introduction (1) Data Stream –Continuously arriving data flow –Applications: network traffic, credit card transaction flow, phone calling records, etc. 1 0 1 1 1 0 1 0 0 1 1

3 Introduction (2) Stream Classification –Construct a classification model based on past records –Use the model to predict labels for new data –Help decision making Fraud? Fraud Classification model Labeling

4 Framework ……… ? Classification Model Predict

5 Concept Drifts Changes in P(x,y) –P(x,y)=P(y|x)P(x) x-feature vector, y-class label –No Change, Feature Change, Conditional Change, Dual Change –Expected error is not a good indicator of concept drifts –Training on the most recent data could help reduce expected error Time Stamp 1 Time Stamp 11 Time Stamp 21

6 Issues in Stream Classification(1) Generative Model –P(y|x) follows some distribution Descriptive Model –Let data decides Stream Data –Distribution unknown and evolving

7 Issues in Stream Classification(2) Label Prediction –Classify x into one class Probability Estimation –x is assigned to all classes with different probabilities Stream Applications –Stochastic, prediction confidence information is needed

8 Mining Skewed Data Stream Skewed Distribution –Credit card frauds, network intrusions Existing Stream Classification Algorithms –Evaluated on balanced data Problems –Ignore minority examples –The cost of misclassifying minority examples is usually huge + - Classify every leaf node as negative

9 Stream Ensemble Approach (1) ……… ? Training set? Insufficient positive examples! Step 1 Sampling

10 Stream Ensemble Approach (2) Step 2 Ensemble C1C1 C2C2 CkCk …… 12k

11 Why this approach works? Incorporation of old positive examples –increase the training size, reduce variance –negative examples reflect current concepts, so the increase in boundary bias is small Ensemble –reduce variance caused by single model –disjoint sets of negative examplesthe classifiers will make uncorrelated errors Bagging & Boosting –running cost is much higher –cannot generate reliable probability estimates for skewed distributions

12 Analysis Error Reduction –Sampling –Ensemble Efficiency Analysis –Single model –Ensemble –Ensemble is more efficient

13 Experiments Measures –Mean Squared Error –ROC Curve –Recall-Precision Curve Baseline Methods –NS: No sampling +Single Model –SS: Sampling + Single Model –SE: Sampling + Ensemble

14 Experimental Results (1) Mean Squared Error on Synthetic Data Feature Change only P(x) changes Conditional Change only P(y|x) changes Dual Change both P(x) and P(y|x) changes

15 Experimental Results (2) Mean Squared Error on Real Data

16 Experimental Results (3) ROC CurveRecall-Precision Plot Plots on Synthetic Data

17 Experimental Results (4) ROC CurveRecall-Precision Plot Plots on Real Data

18 Experimental Results (5) Training Time

19 Conclusions General issues in stream classification –concept drifts –descriptive model –probability estimation Mining skewed data streams –sampling and ensemble techniques –accurate and efficient Wide applications –graph data –airforce data

20 Thanks! Any questions?


Download ppt "A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois."

Similar presentations


Ads by Google