A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,
A Fully Distributed Framework for Cost-sensitive Data Mining Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Salvatore J. Stolfo.
A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.
Pruning and Dynamic Scheduling of Cost-sensitive Ensembles Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Fang Chu UCLA, Los.
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Is Random Model Better? -On its accuracy and efficiency-
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.
ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany.
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
Data Mining Lecture 9.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Classification: Alternative Techniques
Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University.
CMPUT 466/551 Principal Source: CMU
Decision Tree Rong Jin. Determine Milage Per Gallon.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Ensemble Learning: An Introduction
Fraud Detection Experiments Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud First Union Credit Card.
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
Ensemble-based Adaptive Intrusion Detection Wei Fan IBM T.J.Watson Research Salvatore J. Stolfo Columbia University.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Chapter 5 Data mining : A Closer Look.
Ensemble Learning (2), Tree and Forest
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Active Learning for Class Imbalance Problem
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Scaling up Decision Trees. Decision tree learning.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Ensemble Methods in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Data Transformation: Normalization
Data Driven Resource Allocation for Distributed Learning
Cost-Sensitive Learning
Data Mining Lecture 11.
Cost-Sensitive Learning
Somi Jacob and Christian Bach
A task of induction to find patterns
Presentation transcript:

A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore J. Stolfo Columbia University

Scalable Issues of Data Mining Scalable Issues of Data Mining ƒTwo folds: the data and the algorithm. ƒDataset: too big to fit into memory. inherently distributed across the network. incremental data available periodically.

Scalable Issues of Data Mining Scalable Issues of Data Mining ƒLearning algorithm: non-linear complexity in the size of dataset n. memory based due to random access pattern of record in dataset. significantly slower if dataset is not held entirely in memory. ƒState-of-the-art many scalable solutions are algorithm specific. decision trees: SPRINT, RainForest and BOAT general algorithms are not very scalable and only work for cost-insensitive problems meta-learning ƒQuestion: general and work for both cost-sensitive and cost-insentive problems.

Cost-sensitive Problems ƒCharity Donation: Solicit to people who will donate large amount of charity. Costs $0.68 to send a letter. E(x): expected donation amount. Only solicit if E(x) > 0.68, otherwise lose money. ƒ Credit card fraud detection: Detect frauds with high transaction amount $90 to challenge a potential fraud E(x): expected fraudulant transaction amount. Only challenge if E(x) > $90, otherwise lose money. ƒQuestion: how to estimate E(x) efficiently?

Basic Framework D D1D1 D2D2 D2D2 large dataset partition into k subsets ML 1 ML 2 ML t C1C1 C2C2 CkCk generate k models

Basic Framework D Test Set C1C1 C2C2 CkCk Sent to k models P1P1 P2P2 PkPk Compute k predictions Combine P Combine to one prediction

Cost-sensitive Decision Making ƒAssume that records the benefit received by predicting an example of class to be an instance of class. ƒThe expected benefit received to predict an example to be an instance of class (regardless of its true label) is ƒThe optimal decision-making policy chooses the label that maximizes the expected benefit, i.e., ƒWhen and is a traditional accuracy-based problem. ƒTotal benefits

Charity Donation Example ƒIt costs $.68 to send a solicitation. ƒAssume that is the best estimate of the donation amount, ƒThe cost-sensitive decision making will solicit an individual if and only if

Credit Card Fraud Detection Example ƒIt costs $90 to challenge a potential fraud ƒAssume that y(x) is the transaction amount ƒThe cost-sensitive decision making policy will predict a transaction to be fraudulent if and only if

Adult Dataset ƒDownloaded from UCI database. ƒAssociate a benefit factor 2 to positives and a benefit factor 1 to negatives ƒThe decision to predict positive is

Calculating probabilities For decision trees, is the number of examples in a node and is the number of examples with class label, then the probability is more sophisticated methods smoothing: early stopping, and early stopping plus smoothing For rules, probability is calucated in the same way as decision trees For naive Bayes, is the score for class label, then binning

Combining Technique-Averaging ƒEach model computes an expected benefit for example over every class label ƒCombining individual expected benefit together ƒWe choose the label with the highest combined expected benefit

1. Decision threshold line 2. Examples on the left are more profitable than those on the right 3. "Evening effect": biases towards big fish. Why accuracy is higher?

More sophisticated combining approaches ƒRegression: Treat base classifiers' outputs as indepedent variables of regression and the true label as dependent variables. ƒModify Meta-learning: Learning a classifier that maps the base classifiers' class label predictions to that the true class label. For cost-sensitive learning, the top level classifier output probability instead of just a label.

Experiments ƒLearner: C4.5 version 8 ƒDataset: Donation (KDD98) Credit Card Adult ƒNumber of partitions: 8,16,32,64,128,and 256

Accuracy comparision

Accuracy comparison

Detailed Spread

Credit Card Fraud Dataset

Adult Dataset

Why accuracy is higher?

Scalability Analysis of Averaging Method ƒBaseline: a single model that is computed from the entire dataset as a whole. ƒOur approach: ensemble of multiple models, each of which is computed from disjoint datasets.

Scalability Analysis ƒSerial Improvment ƒParallel Improvment ƒSpeedup ƒScaled Speedup

Scalability Results - Serial Improvement

Scalability Results - Parallel Improvement

Scalability Results - Speedup

D1D1 D2D2 D2D2 k sites ML 1 ML 2 ML t C1C1 C2C2 CkCk generate k models Fully distributed learning framework

Communication overhead

Overhead analysis

Summary and Future Work ƒEvaluated a wide range of combining techniques include variations of averaging, regression and meta- learning for scalable cost-sensitive (and cost- insensitive learning). ƒAveraging, although simple, has the highest accuracy. ƒPreviously proposed approaches have significantly more overhead and only work well for tradtional accuracy-based problems. ƒFuture work: ensemble pruning and performance estimation

ƒSuppose that is the probability that is an instance of class label. ƒAn inductive model will always predict the label with the highest probability, i.e., ƒThe accuracy of a method on dataset is Accuracy-based Problems (0-1 loss)