Presentation on theme: "Is Random Model Better? -On its accuracy and efficiency-"— Presentation transcript:
1 Is Random Model Better? -On its accuracy and efficiency- Wei FanIBM T.J.WatsonJoint work withHaixun Wang, Philip S. Yu, and Sheng Ma
2 Optimal Model Loss function L(t,y) to evaluate performance. t is true label and y is predictionOptimal decision decision y* is the label that minimizes expected loss when x is sampled repeatedly:0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraudcost-sensitive loss: the label that minimizes the “empirical risk”.If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud
3 How we look for optimal model? NP-hard for most “model representation”We think that simplest hypothesis that fits the data is the best.We employ all kinds of heuristics to look for it.info gain, gini index, Kearns-Mansour, etcpruning: MDL pruning, reduced error-pruning, cost-based pruning.Reality: tractable, but still pretty expensive
4 On the other handOccam’s Razor’s interpretation: two hypotheses with the same loss, we should prefer the simpler one.Very complicated hypotheses that are highly accurate:Meta-learningBoosting (weighted voting)Bagging (sampling without replacement)Where are we? The above are very complicated to compute.Question: do we have to?
5 Do we have to be “perfect”? 0-1 loss binary problem:P(positive|x) > 0.5, we predict x to be positive.P(positive|x) = 0.6, P(positive|x) = 0.9 makes no difference in final prediction!Cost-sensitive problems:P(fraud|x) * $1000 > $90, we predict x to be fraud.Re-write it P(fraud|x) > 0.09P(fraud|x) = 1.0 and P(fraud|x) = makes no difference.
6 Random Decision TreeBuilding several empty iso-depth tree structures without even looking at the data.Example is sorted through a unique path from the root the the leaf. Each tree node records the number of instances belonging to each class.Update each empty node by scanning the data set only once.It is like “classifying” the data.When an example reaches a node, the number of examples belonging to a particular class label increments
7 Classification Each tree outputs membership probability p(fraud|x) = n_fraud/(n_fraud + n_normal)The membership probability from multiple random trees are averaged to approximate the true probabilityLoss function is required to make a decision0-1 loss: p(fraud|x) > 0.5, predict fraudcost-sensitive loss: p(fraud|x) $1000 > $90
8 Tree depth To create diversity Half of the number of features Combinations peak at half the size of populationSuch as, combine 2 out 4 gives 6 choices.
9 Number of trees Sampling theory: Worst scenario 30 gives pretty good estimate with reasonably small variance10 is usually already in the range.Worst scenarioOnly one feature is relevant. All the rest are noise.Probability:
10 Simple Feature Info Gain Limitation:At least one feature with info gain by itselfSame limitation as C4.5 and dtiExample
11 Training Efficiency One complete scan of the training data. Memory Requirement:Hold one tree (or better multiple trees)One example read at a time.
12 Donation Dataset Decide whom to send charity solicitation letter. It costs $0.68 to send a letter.Loss function
21 ConclusionPoint out the reality that conventional inductive learning (single best and multiple complicated) are probably way too complicated beyond necessityPropose a very efficient and accurate random tree algorithm