Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.
A Fully Distributed Framework for Cost-sensitive Data Mining Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Salvatore J. Stolfo.
A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.
A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.
Pruning and Dynamic Scheduling of Cost-sensitive Ensembles Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Fang Chu UCLA, Los.
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Actively Transfer Domain Knowledge Xiaoxiao Shi Wei Fan Jiangtao Ren Sun Yat-sen University IBM T. J. Watson Research Center Transfer when you can, otherwise.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Is Random Model Better? -On its accuracy and efficiency-
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany.
Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Jiangtao Ren 1 Xiaoxiao Shi 1 Wei Fan 2 Philip S. Yu 2 1.
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
Random Forest Predrag Radenković 3237/10
1 Active Mining of Data Streams Wei Fan, Yi-an Huang, Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004 Speaker: Pei-Min.
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.
CMPUT 466/551 Principal Source: CMU
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
ECML Estimating the predictive accuracy of a classifier Hilan Bensusan Alexandros Kalousis.
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
© 2010 Pearson Prentice Hall. All rights reserved Hypothesis Testing Using a Single Sample.
Decision Tree Rong Jin. Determine Milage Per Gallon.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Sample size computations Petter Mostad
Fraud Detection Experiments Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud First Union Credit Card.
Three kinds of learning
Ensemble-based Adaptive Intrusion Detection Wei Fan IBM T.J.Watson Research Salvatore J. Stolfo Columbia University.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Ensemble Learning (2), Tree and Forest
Enterprise systems infrastructure and architecture DT211 4
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
What is Data Mining? process of finding correlations or patterns among dozens of fields in large relational databases process of finding correlations or.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
黃福銘 (Angus F.M. Huang) ANTS Lab, IIS, Academia Sinica Exploring Spatial-Temporal Trajectory Model for Location.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Bias Management in Time Changing Data Streams We assume data is generated randomly according to a stationary distribution. Data comes in the form of streams.
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Features & Decision regions
Open-Category Classification by Adversarial Sample Generation
iSRD Spam Review Detection with Imbalanced Data Distributions
Decision Trees for Mining Data Streams
Learning from Data Streams
Jia-Bin Huang Virginia Tech
Presentation transcript:

Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson 2 Georgia Tech

Sad life cycle of inductive models Labeled Data Inductive Learner Inductive Model Decision Trees Rules Na ï ve Bayes Credit card transaction -> {fraud, normal} Un-labeled Real-time Streaming Data Predictions God knows the accuracy True Labels Accuracy too low!!!

Seen any problems? Problem 1: we have no idea of the accuracy in the streaming environment. Problem 2: how long we can wait and how much we can afford to loose until we get labeled data?

Solutions Solution I: error guessing and estimation. Idea 1: using observable statistical traits from the model itself to guess the error on unlabeled streaming data. Idea 2: using very small number of specifically acquired examples to statistically estimate the error – similar to estimate poll to estimate Bush or Kerry will win the presidency. Details: Active Mining of Data Streams by Wei Fan, Yi-an Huang, and Philip S. Yu appearing in SDM 04.

Solutions Okay, assuming that we know that our model is too low in accuracy. Obviously, we need more accurate models. Solution II: We need to update our model with limited number of training examples We are interested in decision trees.

Decision Tree Example A < 100 B < 50C < 34 Y N +: : 400 P(+|x) = 0.2 +: 90 - : 10 P(-|x) = 0.1 y N

Class Distribution Replacement If a node is considered suspicious using one of our detection techniques, we can perform class distribution replacement. The idea is that:

Class Distribution Replacement +: : 400 P(+|x) = 0.2 A < 100 B < 50C < 34 Y N +: 90 - : 10 P(-|x) = 0.1 y N Using limited number of examples, the new class distribution is P(+|x) = 0.4 P(+|x) = 0.4

Some Statistics for Significance Test Proportion statistics: formula is in paper and many statistics books. Assume Gaussian distribution and compute significance

Leaf Expansion Assume that significance test in leaf expansion fails. Solution: reconstruct the leaf using limited number of examples. Catch: not always possible. If the limited number of examples cannot justify an expansion, just keep the original node.

Result on Class Distribution Replacement

Result on Leaf Node Expansion

More results in the paper Credit card fraud dataset. UCI Adult Dataset.

Conclusion Pointed out the gap between data availability and pattern change. Proposed a general framework. Proposed a few methods to update and grow a decision tree from limited number of examples.