Download presentation

Presentation is loading. Please wait.

Published byJackson Clarke Modified over 2 years ago

1
Is Random Model Better? - On its accuracy and efficiency - Wei Fan IBM T.J.Watson Joint work with Haixun Wang, Philip S. Yu, and Sheng Ma

2
Optimal Model Loss function L(t,y) to evaluate performance. t is true label and y is prediction Optimal decision decision y* is the label that minimizes expected loss when x is sampled repeatedly: 0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud cost-sensitive loss: the label that minimizes the empirical risk. If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud

3
How we look for optimal model? NP-hard for most model representation We think that simplest hypothesis that fits the data is the best. We employ all kinds of heuristics to look for it. info gain, gini index, Kearns-Mansour, etc pruning: MDL pruning, reduced error-pruning, cost-based pruning. Reality: tractable, but still pretty expensive

4
On the other hand Occam s Razor s interpretation: two hypotheses with the same loss, we should prefer the simpler one. Very complicated hypotheses that are highly accurate: Meta-learning Boosting (weighted voting) Bagging (sampling without replacement) Where are we? The above are very complicated to compute. Question: do we have to?

5
Do we have to be perfect ? 0-1 loss binary problem: P(positive|x) > 0.5, we predict x to be positive. P(positive|x) = 0.6, P(positive|x) = 0.9 makes no difference in final prediction! Cost-sensitive problems: P(fraud|x) * $1000 > $90, we predict x to be fraud. Re-write it P(fraud|x) > 0.09 P(fraud|x) = 1.0 and P(fraud|x) = makes no difference.

6
Random Decision Tree Building several empty iso-depth tree structures without even looking at the data. Example is sorted through a unique path from the root the the leaf. Each tree node records the number of instances belonging to each class. Update each empty node by scanning the data set only once. It is like classifying the data. When an example reaches a node, the number of examples belonging to a particular class label increments

7
Classification Each tree outputs membership probability p(fraud|x) = n_fraud/(n_fraud + n_normal) The membership probability from multiple random trees are averaged to approximate the true probability Loss function is required to make a decision 0-1 loss: p(fraud|x) > 0.5, predict fraud cost-sensitive loss: p(fraud|x) $1000 > $90

8
Tree depth To create diversity Half of the number of features Combinations peak at half the size of population Such as, combine 2 out 4 gives 6 choices.

9
Number of trees Sampling theory: 30 gives pretty good estimate with reasonably small variance 10 is usually already in the range. Worst scenario Only one feature is relevant. All the rest are noise. Probability:

10
Simple Feature Info Gain Limitation: At least one feature with info gain by itself Same limitation as C4.5 and dti Example

11
Training Efficiency One complete scan of the training data. Memory Requirement: Hold one tree (or better multiple trees) One example read at a time.

12
Donation Dataset Decide whom to send charity solicitation letter. It costs $0.68 to send a letter. Loss function

13
Result

14

15
Credit Card Fraud Detect if a transaction is a fraud There is an overhead to detect a fraud, {$60, $70, $80, $90} Loss Function

16
Result

17
Extreme situation

18
Tree Depth

19
Compare with Boosting

20
Compare with Bagging

21
Conclusion Point out the reality that conventional inductive learning (single best and multiple complicated) are probably way too complicated beyond necessity Propose a very efficient and accurate random tree algorithm

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google