Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,
A Fully Distributed Framework for Cost-sensitive Data Mining Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Salvatore J. Stolfo.
A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.
A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.
Pruning and Dynamic Scheduling of Cost-sensitive Ensembles Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Fang Chu UCLA, Los.
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Actively Transfer Domain Knowledge Xiaoxiao Shi Wei Fan Jiangtao Ren Sun Yat-sen University IBM T. J. Watson Research Center Transfer when you can, otherwise.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Is Random Model Better? -On its accuracy and efficiency-
Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.
Experience with Simple Approaches Wei Fan Erheng Zhong Sihong Xie Yuzhao Huang Kun Zhang $ Jing Peng # Jiangtao Ren IBM T. J. Watson Research Center Sun.
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
Classification, Regression and Other Learning Methods CS240B Presentation Peter Huang June 4, 2014.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
CMPUT 466/551 Principal Source: CMU
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Sparse vs. Ensemble Approaches to Supervised Learning
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Fraud Detection Experiments Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud First Union Credit Card.
Sparse vs. Ensemble Approaches to Supervised Learning
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
3 ème Journée Doctorale G&E, Bordeaux, Mars 2015 Wei FENG Geo-Resources and Environment Lab, Bordeaux INP (Bordeaux Institute of Technology), France Supervisor:
Machine Learning CS 165B Spring 2012
Active Learning for Class Imbalance Problem
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Boosting Neural Networks Published by Holger Schwenk and Yoshua Benggio Neural Computation, 12(8): , Presented by Yong Li.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence Presented by: Afsoon.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Bayesian Averaging of Classifiers and the Overfitting Problem Rayid Ghani ML Lunch – 11/13/00.
Ensemble Methods in Machine Learning
Using Classification Trees to Decide News Popularity
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
HW 2.
Reading: R. Schapire, A brief introduction to boosting
Ensemble methods with Data Streams
Cost-Sensitive Learning
ECE 5424: Introduction to Machine Learning
Combining Base Learners
Bayesian Averaging of Classifiers and the Overfitting Problem
Data Mining Practical Machine Learning Tools and Techniques
Cost-Sensitive Learning
Ensembles.
Ensemble learning Reminder - Bagging of Trees Random Forest
Decision Trees for Mining Data Streams
Learning from Data Streams
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University

Problems Many inductive algorithms are main memory-based. When the dataset is bigger than the memory, it will "thrash". Very low in efficiency when thrashing happens. For algorithms that are not memory-based, Do we need to see every piece of data? Probably not. Overfitting curve? Not practical.

Basic Idea:One Scan Algorithm Batch 4 Batch 3 Batch 2 Batch 1 Algorithm Model

Loss and Benefit Loss function: Evaluate performance. Benefit matrix – inverse of loss func Traditional 0-1 loss b[x,x] = 1, b[x,y] = 0 Cost-sensitive loss Overhead of $90 to investigate a fraud. b[fraud, fraud] = $tranamt - $90. b[fraud, nonfraud] = $0. b[nonfraud, fraud] = -$90. b[nonfraud, nonfraud] = $0.

Probabilistic Modeling is the probability that x is an instance of class is the expected benefit Optimal decision

Example p(fraud|x) = 0.5 and tranamt = $200 e(fraud|x) = b[fraud,fraud]p(fraud|x) + b[nonfraud, fraud] p(nonfraud|x) =(200 – 90) x (-90) x 0.5 = $10 E(nonfraud|x) = b[fraud,nonfraud]p(fraud|x) + b[nonfraud,nonfraud]p(nonfraud|x) = 0 x x 0.5 = always 0 Predict fraud since we get $10 back.

Combining Multiple Models Individual benefits Averaged benefits Optimal decision

How about accuracy

Do we need all K models? We stop learning if k (< K) models have the same accuracy as K models with confidence p. Ends up scanning the dataset less than 1. Use statistical sampling.

Less than one scan Batch 4 Batch 3 Batch 2 Batch 1 Algorithm Accurate Enough? Model No Yes

Hoeffding s inequality Random variable within R=a-b After n observations, its mean value is y. What is its error with confidence p regardless of the distribution?

When can we stop? Use k models highest expected benefit Hoeffding s error: second highedt expected benefit Hoeffding s error: The majority label is still with confidence p iff

Less Than One Scan Algorithm Iterate the process on every instance from a validation set. Until every instance has the same prediction as the full ensemble with confidence p.

Validation Set If we fail on one example x, we do not need to examine on another one. So we can keep only one example in memory at a time. If k base models s prediction on x is the same as K models. It is very likely that k+1 models will also be the same as K models with the same confidence.

Validation Set At anytime, we only need to keep one data item x from the validation set. It is sequentially read from the validation set. The validation set is read only once. What can be a validation set? The training set itself A separate holdout set.

Amount of Data Scan Training Set : at most one Validation Set: once. Using training as validation set: Once we decide to train model from a batch, we do not use it for validation again. How much is used to train model? Less than one.

Experiments Donation Dataset: Total benefits: donated charity minus overhead to send solicitations.

Experiment Setup Inductive learners: C4.5 RIPPER NB Number of base models: {8,16,32,64,128,256} Reports their average

Baseline Results (with C4.5) Single model: $ Complete One Scan: $ The average of {8,16,32,64,128,256} We are actually $1410 higher than the single model.

Less-than-one scan (with C4.5) Full one scan: $14702 Less-than-one scan: $14828 Actually a little higher, $126. How much data scanned with 99.7% confidence? 71%

Other datasets Credit card fraud detection Total benefits: Recovered fraud amount minus overhead of investigation

Results Baseline single: $ (with curtailed probability) One scan ensemble: $ Less than one scan: $ Data scan amount: 64%

Smoothing effect.

Related Work Ensenbles: Meta-learning (Chan and Stolfo): 2 scans Bagging (Breiman) and AdaBoost (Freund and Schapire): multiple Use of Hoeffding s inequality: Aggregate query (Hellerstein et al) Streaming decision tree (Hulten and Domingos) Single decision tree, less than one scan Scalable decision tree: SPRINT (Shafer et al): multiple scans BOAT (Gehrke et al): 2 scans

Conclusion Both one scan and less than one scan have accuracy either similar or higher than the single model. Less than one scan uses approximately 60% – 90% of data for training with loss of accuracy.