Is Random Model Better? -On its accuracy and efficiency-

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.
A Fully Distributed Framework for Cost-sensitive Data Mining Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Salvatore J. Stolfo.
A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.
A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.
Pruning and Dynamic Scheduling of Cost-sensitive Ensembles Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Fang Chu UCLA, Los.
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.
Experience with Simple Approaches Wei Fan Erheng Zhong Sihong Xie Yuzhao Huang Kun Zhang $ Jing Peng # Jiangtao Ren IBM T. J. Watson Research Center Sun.
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
Random Forest Predrag Radenković 3237/10
Decision Trees Decision tree representation ID3 learning algorithm
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Longin Jan Latecki Temple University
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Lecture outline Classification Decision-tree classification.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Ensemble Learning: An Introduction
Lecture 5 (Classification with Decision Trees)
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
ICS 273A Intro Machine Learning
Bayesian Learning Rong Jin.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
Issues with Data Mining
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
by B. Zadrozny and C. Elkan
CS 391L: Machine Learning: Ensembles
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Learning from Observations Chapter 18 Through
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CLASSIFICATION: Ensemble Methods
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Ensemble Methods in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Machine Learning: Ensemble Methods
University of Waikato, New Zealand
DECISION TREES An internal node represents a test on an attribute.
Trees, bagging, boosting, and stacking
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning Chapter 3. Decision Tree Learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning Reminder - Bagging of Trees Random Forest
Presentation transcript:

Is Random Model Better? -On its accuracy and efficiency- Wei Fan IBM T.J.Watson Joint work with Haixun Wang, Philip S. Yu, and Sheng Ma

Optimal Model Loss function L(t,y) to evaluate performance. t is true label and y is prediction Optimal decision decision y* is the label that minimizes expected loss when x is sampled repeatedly: 0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud cost-sensitive loss: the label that minimizes the “empirical risk”. If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud

How we look for optimal model? NP-hard for most “model representation” We think that simplest hypothesis that fits the data is the best. We employ all kinds of heuristics to look for it. info gain, gini index, Kearns-Mansour, etc pruning: MDL pruning, reduced error-pruning, cost-based pruning. Reality: tractable, but still pretty expensive

On the other hand Occam’s Razor’s interpretation: two hypotheses with the same loss, we should prefer the simpler one. Very complicated hypotheses that are highly accurate: Meta-learning Boosting (weighted voting) Bagging (sampling without replacement) Where are we? The above are very complicated to compute. Question: do we have to?

Do we have to be “perfect”? 0-1 loss binary problem: P(positive|x) > 0.5, we predict x to be positive. P(positive|x) = 0.6, P(positive|x) = 0.9 makes no difference in final prediction! Cost-sensitive problems: P(fraud|x) * $1000 > $90, we predict x to be fraud. Re-write it P(fraud|x) > 0.09 P(fraud|x) = 1.0 and P(fraud|x) = 0.091 makes no difference.

Random Decision Tree Building several empty iso-depth tree structures without even looking at the data. Example is sorted through a unique path from the root the the leaf. Each tree node records the number of instances belonging to each class. Update each empty node by scanning the data set only once. It is like “classifying” the data. When an example reaches a node, the number of examples belonging to a particular class label increments

Classification Each tree outputs membership probability p(fraud|x) = n_fraud/(n_fraud + n_normal) The membership probability from multiple random trees are averaged to approximate the true probability Loss function is required to make a decision 0-1 loss: p(fraud|x) > 0.5, predict fraud cost-sensitive loss: p(fraud|x) $1000 > $90

Tree depth To create diversity Half of the number of features Combinations peak at half the size of population Such as, combine 2 out 4 gives 6 choices.

Number of trees Sampling theory: Worst scenario 30 gives pretty good estimate with reasonably small variance 10 is usually already in the range. Worst scenario Only one feature is relevant. All the rest are noise. Probability:

Simple Feature Info Gain Limitation: At least one feature with info gain by itself Same limitation as C4.5 and dti Example

Training Efficiency One complete scan of the training data. Memory Requirement: Hold one tree (or better multiple trees) One example read at a time.

Donation Dataset Decide whom to send charity solicitation letter. It costs $0.68 to send a letter. Loss function

Result

Result

Credit Card Fraud Detect if a transaction is a fraud There is an overhead to detect a fraud, {$60, $70, $80, $90} Loss Function

Result

Extreme situation

Tree Depth

Compare with Boosting

Compare with Bagging

Conclusion Point out the reality that conventional inductive learning (single best and multiple complicated) are probably way too complicated beyond necessity Propose a very efficient and accurate random tree algorithm