Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.

Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois at Chicago

2 Introduction: Data Stream network trafficcredit card transactionsonline message  Applications: online message classification network traffic monitoring credit card transactions classification  Data Stream: high speed data flow continuously arriving, changing

3 Introduction: Stream Classification ++--?? Training dataIncoming data  Stream Classification:  Construct a classification model on past stream data  Use the model to predict the class label for incoming data ++--???? Classification Model Classification Model trainclassify data stream

4 Multi-Label Stream Data In many real apps, one stream object can have multiple labels. Legendary Sad Company News Article Labels … Emails Labels Conventional Stream Classification: Single-label settings: assume one stream object can only have one label

5 Multi-Label Stream Classification Traditional Stream Classification instance label object instance label object label …… Multi-label Stream Classification

6 The problem Stream Data Huge Data Volume + Limited memory cannot store the entire dataset for training Require one-pass algorithm on the stream High Speed Need to process promptly Concept Drifts Old data become outdated …… Multi-label Classification Large number of possible label sets (exponential) Conventional multi-label classification approach focus on offline settings, cannot apply here 0 1 0 1 0 0

7 Our Solution: Random Tree very fast in training and testing Ensemble of multiple trees effective and can reduce the prediction variance Statistics of multiple labels on the tree nodes effective training/testing on multiple labels Fading function reduce the influence of old data …

8 Multi-label Random Tree Single-pass over the data Split each node on random variable with random threshold Ensemble of multiple trees Multi-label predictions Fading out old data Conventional Decision Trees Multi-pass over the dataset Variable selection on each node split Single label prediction Static updates, use the entire dataset including outdated data

9 Training: update trees Tree 1Tree Nt … a c d e f Update node statistics

10 On the Tree Nodes a c Tree Node statistics Statistics on the node 1. Aggregated label relevance vector 2. Aggregated number of instances 3. Aggregated label set cardinalities 4. Time stamp of the latest update Fading function The statistics are rescaled with a time fading function To reduce the effect of the old data on the node statistics

11 Prediction Tree 1 Tree Nt … ??? Aggregate predictions Use the aggregated label relevance to rank all possible labels Use the aggregated set cardinality to decide how many labels are included in the label set

12 Experiment Setup Three methods are compared: Stream Multi-l A bel Random Tree (SMART) Multi-label stream classification with random tree [This Paper] SMART without fading function SMART(static): keep updating the trees without fading Multi-label kNN state-of-the-art multi-label classification method + sliding window

13 Three multi-label stream classification datasets: MediaMill: Video annotation task, from “MediaMill Challenge” TMC2007: Text classification task, from SDM text mining competition RCV1-v2: large-scale text classification task, from Reuters dataset Data Sets --- # instances --- # features --- # labels --- label density

14 Evaluation Multi-Label Metrics [Elisseef&Weston NIPS’02] Ranking Loss ↓ Evaluate the performance on the probability outputs Average number of label pairs being ranked incorrectly The smaller the better Micro F1 ↑ Evaluate the performance on label set prediction Consider both micro average of precision and recall The larger the better Sequential evaluation with concept drifts Mixing two streams

15 Throughput / Efficiency

16 Effectiveness Ranking Loss (lower is better) SMART Multi-label Stream Classification SMART Multi-label Stream Classification MediaMill Dataset Stream (x 4,300 instances) Our approach with multi-label streaming random trees performed best in MediaMill dataset Multi-Label kNN (w=100) (w=200) SMART (static) without fading func (w=400)

17 Effectiveness Micro F1 (higher is better) MediaMill Dataset Stream (x 4,300 instances) Multi-Label kNN (w=100) (w=200) SMART (static) without fading func (w=400) SMART Multi-label Stream Classification SMART Multi-label Stream Classification

18 Experiment Results MediaMill Dataset RCV-1-v2 Dataset TMC2007 Dataset Micro F1 Ranking Loss

19 Experiment Results MediaMill Dataset RCV-1-v2 Dataset TMC2007 Dataset Micro F1 Ranking Loss

20 Conclusions An Ensemble-based approach for Fast Classification of Multi-Label Data Stream Ensemble-based approach (effective) Predict multiple labels Very fast in training/updating node statistics and prediction using random trees (efficient) Thank you!

Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.

Similar presentations

Presentation on theme: "Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.

Similar presentations

Presentation on theme: "Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois."— Presentation transcript:

Similar presentations

About project

Feedback