# Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

## Presentation on theme: "Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University."— Presentation transcript:

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University

Problems Many inductive algorithms are main memory-based. When the dataset is bigger than the memory, it will "thrash". Very low in efficiency when thrashing happens. For algorithms that are not memory-based, Do we need to see every piece of data? Probably not. Overfitting curve? Not practical.

Basic Idea:One Scan Algorithm Batch 4 Batch 3 Batch 2 Batch 1 Algorithm Model

Loss and Benefit Loss function: Evaluate performance. Benefit matrix – inverse of loss func Traditional 0-1 loss b[x,x] = 1, b[x,y] = 0 Cost-sensitive loss Overhead of \$90 to investigate a fraud. b[fraud, fraud] = \$tranamt - \$90. b[fraud, nonfraud] = \$0. b[nonfraud, fraud] = -\$90. b[nonfraud, nonfraud] = \$0.

Probabilistic Modeling is the probability that x is an instance of class is the expected benefit Optimal decision

Example p(fraud|x) = 0.5 and tranamt = \$200 e(fraud|x) = b[fraud,fraud]p(fraud|x) + b[nonfraud, fraud] p(nonfraud|x) =(200 – 90) x 0.5 + (-90) x 0.5 = \$10 E(nonfraud|x) = b[fraud,nonfraud]p(fraud|x) + b[nonfraud,nonfraud]p(nonfraud|x) = 0 x 0.5 + 0 x 0.5 = always 0 Predict fraud since we get \$10 back.

Combining Multiple Models Individual benefits Averaged benefits Optimal decision

Do we need all K models? We stop learning if k (< K) models have the same accuracy as K models with confidence p. Ends up scanning the dataset less than 1. Use statistical sampling.

Less than one scan Batch 4 Batch 3 Batch 2 Batch 1 Algorithm Accurate Enough? Model No Yes

Hoeffding s inequality Random variable within R=a-b After n observations, its mean value is y. What is its error with confidence p regardless of the distribution?

When can we stop? Use k models highest expected benefit Hoeffding s error: second highedt expected benefit Hoeffding s error: The majority label is still with confidence p iff

Less Than One Scan Algorithm Iterate the process on every instance from a validation set. Until every instance has the same prediction as the full ensemble with confidence p.

Validation Set If we fail on one example x, we do not need to examine on another one. So we can keep only one example in memory at a time. If k base models s prediction on x is the same as K models. It is very likely that k+1 models will also be the same as K models with the same confidence.

Validation Set At anytime, we only need to keep one data item x from the validation set. It is sequentially read from the validation set. The validation set is read only once. What can be a validation set? The training set itself A separate holdout set.

Amount of Data Scan Training Set : at most one Validation Set: once. Using training as validation set: Once we decide to train model from a batch, we do not use it for validation again. How much is used to train model? Less than one.

Experiments Donation Dataset: Total benefits: donated charity minus overhead to send solicitations.

Experiment Setup Inductive learners: C4.5 RIPPER NB Number of base models: {8,16,32,64,128,256} Reports their average

Baseline Results (with C4.5) Single model: \$13292.7 Complete One Scan: \$14702.9 The average of {8,16,32,64,128,256} We are actually \$1410 higher than the single model.

Less-than-one scan (with C4.5) Full one scan: \$14702 Less-than-one scan: \$14828 Actually a little higher, \$126. How much data scanned with 99.7% confidence? 71%

Other datasets Credit card fraud detection Total benefits: Recovered fraud amount minus overhead of investigation

Results Baseline single: \$733980 (with curtailed probability) One scan ensemble: \$804964 Less than one scan: \$804914 Data scan amount: 64%

Smoothing effect.

Related Work Ensenbles: Meta-learning (Chan and Stolfo): 2 scans Bagging (Breiman) and AdaBoost (Freund and Schapire): multiple Use of Hoeffding s inequality: Aggregate query (Hellerstein et al) Streaming decision tree (Hulten and Domingos) Single decision tree, less than one scan Scalable decision tree: SPRINT (Shafer et al): multiple scans BOAT (Gehrke et al): 2 scans

Conclusion Both one scan and less than one scan have accuracy either similar or higher than the single model. Less than one scan uses approximately 60% – 90% of data for training with loss of accuracy.

Download ppt "Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University."

Similar presentations