Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.

Slides:

Advertisements

Similar presentations

Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey

Advertisements

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,

A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.

A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.

Pruning and Dynamic Scheduling of Cost-sensitive Ensembles Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Fang Chu UCLA, Los.

On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu.

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.

Is Random Model Better? -On its accuracy and efficiency-

Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.

ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany.

When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.

COMP3740 CR32: Knowledge Management and Adaptive Systems

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Random Forest Predrag Radenković 3237/10

Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.

Imbalanced data David Kauchak CS 451 – Fall 2013.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

Decision Tree Approach in Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Data Mining Classification: Alternative Techniques

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

CMPUT 466/551 Principal Source: CMU

Chapter 7 – Classification and Regression Trees

Longin Jan Latecki Temple University

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

Decision Tree Rong Jin. Determine Milage Per Gallon.

Sparse vs. Ensemble Approaches to Supervised Learning

On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.

Decision Tree Algorithm

Basic Data Mining Techniques Chapter Decision Trees.

Ensemble Learning: An Introduction

Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%

Machine Learning: Ensemble Methods

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Ensemble Learning (2), Tree and Forest

Basic Data Mining Techniques

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

by B. Zadrozny and C. Elkan

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.

Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.

Benk Erika Kelemen Zsolt

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Ensemble Methods in Machine Learning

1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.

1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.

Machine Learning: Ensemble Methods

Data Mining Practical Machine Learning Tools and Techniques

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

CSCI N317 Computation for Scientific Applications Unit Weka

Model Combination.

Ensemble learning Reminder - Bagging of Trees Random Forest

Presentation transcript:

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson

About … … Data streams: continuous stream of new data, generated either in real time or periodically. Credit card transactions Stock trades. Insurance claim data. Phone call records Our notations.

Data Streams Old data New data t1t1 t2t2 t3t3 t4t4 t5t5

Data Stream Mining Characteristics: may change over time. Main goal of stream mining: make sure that the constructed model is the most accurate and up-to-date.

Data Sufficiency Definition: A dataset is considered sufficient if adding more data items will not increase the final accuracy of a trained model significantly. We normally do not know if a dataset is sufficient or not. Sufficiency detection: Expensive progressive sampling experiment. Keep on adding data and stop when accuracy doesn t increase significantly. Dependent on both dataset and algorithm Difficult to make a general claim

Possible changes of data streams Possible concept drift. For the same feature vector, different class labels are generated at some later time Or stochastically, with different probabilities. Possible data sufficiency. Other possible changes not addressed in our paper. Most important of all: These are possibilities. No Oracle out there to tell us the truth! Dangerous to make assumptions.

How many combinations? Four combinations: Sufficient and no drift. Insufficient and no drift. Sufficient and drift. Insufficient and drift Question: Does the most accurate model remain the same under all four situations?

Case 1: Sufficient and no drift Solution one: Throw away old models and data. Re-train a new model from new data. By definitions of data sufficiency. Solution two: If old model is trained from sufficient data, just use the old model

Case 2: Sufficient and drift Solution one: Train a new model from new data Same sufficiency definition.

Case 3: Insufficient and no drift Possibility I: if old model is trained from sufficient data, keep the old model. Possibility II: otherwise, combine new data and old data, and train a new model.

Case 4: Insufficient and drift Obviously, new data is not enough by definition. What are our options. Use old data? But how?

A moving hyper plane

See any problems? Which old data items can we use?

We need to be picky

Inconsistent Examples

Consistent examples

See more problems? We normally never know which of the four cases a real data stream belongs to. It may change over time from case to case. Normally, no truth is known apriori or even later.

Solution Requirements: The right solution should not be one size fits all. Should not make any assumptions. Any assumptions can be wrong. It should be adaptive. Let the data speak for itself. We prefer model A over model B if the accuracy of A on the evolving data stream is likely to be more accurate than B. No assumptions!

An Un-biased Selection framework Train FN from new data. Train FN+ from new data and selected consistent old data. Assume FO is the previous most accurate model. Update FO using the new data. Call it FO+. Use cross-validation to choose among the four candidate models {FN, FN+, FO, and FO+}.

Consistent old data Theoretically, if we know the true models, we can use the true models to choose consistent data. But we don t Practically, we have to rely on optimal models. Go back to the hyper plane example

A moving hyper plane

Their optimal models

True model and optimal models True model. Perfect model: never makes mistakes. Not always possible due to: Stochastic nature of the problem Noise in training data Data is insufficient Optimal model: defined over a given loss function.

Optimal Model Loss function L(t,y) to evaluate performance. t is true label and y is prediction Optimal decision decision y* is the label that minimizes the expected loss when x is sampled many times: 0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud cost-sensitive loss: the label that minimizes the empirical risk. If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud

Random decision trees Train multiple trees. Details to follow. Each tree outputs posterior probability when classifying an example x. The probability outputs of many trees are averaged as the final probability estimation. Loss function and probability are used to make the best prediction.

Training At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Example Gender? MF Age>30 y n P: 100 N: 150 P: 1 N: 9 … …… … Age> 25

Training: Continued We stop when one of the following happens: A node becomes empty. Or the total height of the tree exceeds a threshold, currently set as the total number of features. Each node of the tree keeps the number of examples belonging to each class.

Classification Each tree outputs membership probability p(fraud|x) = n_fraud/(n_fraud + n_normal) If a leaf node is empty (very likely for when discrete feature is tested at the end): Use the parent nodes probability estimate but do not output 0 or NaN The membership probability from multiple random trees are averaged to approximate as the final output Loss function is required to make a decision 0-1 loss: p(fraud|x) > 0.5, predict fraud cost-sensitive loss: p(fraud|x) $1000 > $90

N-fold Cross-validation with Random Decision Trees Tree structure is independent from the data. Compensation when computing probability

Key advantage n-fold cross validation comes easy. Same cost as testing the model once on the training data. Training is efficient since we do not compute information gain. It is actually also very accurate.

Experiments I have a demo available to show. Please contact me. In the paper. I have the following experiments. Synthetic datasets. Credit card fraud datasets. Donation datasets.

Compare This new selective framework proposed in this paper. Our last year s hard coded ensemble framework. Use k number of weighted ensembles. K=1. Only train on new data. K=8. Use new data and previous 7 periods of model. Classifier is weighted against new data. Sufficient and insufficient. Always drift.

Data insufficient: new method

Last year s method

Avg Result

Data sufficient: new method

Data sufficient: last year s method

Avg Result

Independent study and implementation of random decision tree Kai Ming Ting and Tony Liu from U of Monash, Australia on UCI datasets Edward Greengrass from DOD on their data sets. 100 to 300 features. Both categorical and continuous features. Some features have a lot of values to 3000 examples. Both binary and multiple class problem (16 and 25)

Related publications on random trees Is random model better? On its accuracy and efficiency ICDM 2003 On the optimality of probability estimation by random decision trees AAAI 04. Mining concept-drifting data streams with ensemble classifiers SIGKDD2003