Presentation on theme: "Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey"— Presentation transcript:
1Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey Effective Estimation of Posterior Probabilities: -Explaining the Accuracy of Randomized Decision Tree ApproachesWei FanEd GreengrassJoe McCloskeyPhilip S. YuKevin Drummey
2Example Simple Decision Tree Method Construction: at each node, a feature is chosen randomlyDiscrete:only if it has never been chosen previously on a given decision path starting from the root of the tree.every example on the same path has the same discrete feature value.Continuous feature:can be chosen multiple times on the same decision path.each time a random threshold value is chosen
3continues…Stop when:Number of examples in the leaf node is too small.The total height of the tree exceeds some limits.Each node of the tree keeps the number of examples belonging to each class.For example, 10 + and 5 –Construct at least 10 trees but no need to be more than 30.
4Classification Each tree output estimated posterior probability: A node with 10 + and 5 - outputs P(+|x,t) = 0.67Multiple trees average their probability estimates as the final output.Use the estimated probability and given loss function to choose label that minimize expected loss.0-1 loss or traditional accuracy: choose the most probable labelCost-sensitive: choose the label that minimize risk.
5Difference from Traditional No Gain function.Info gainGini indexKearn-Mansour criteriaothersNo Feature Selection.Don’t choose feature with highest “gain”Multiple trees.Relies on probability estimates.
6How well it works? Credit card fraud detection: Three models: Each transaction has a transaction amount.There is an overhead $90 to challenge a fraud.Predict fraud iifP(fraud|x) $1000 > $90P(fraud|x) $1000 is expected lossWhen expected loss is more than overhead, do sth.Three models:Traditional Unpruned decision treeTraditional Pruned decision treeRDT
8Randomization Feature selection randomization: RDT: completely random.Random Forest: consider random subset at each node.etcFeature subset randomization.Fixed random subset.Data randomization:Bootstrap sample. Bagging and Random ForestData PartitioningFeature Combination.
9Methods Included RDT: RF and RF+ (variation of Random Forest): Choose feature randomly.Choose threshold for continuous randomly.RF and RF+ (variation of Random Forest):Chooses k features randomly.Choose the one among k with highest infogainVariation I: use original dataset.Variation II: output probability instead of voting.
11Some conceptsTrue posterior probability P(y|x)Probability of an example to be a class y as a condition of its feature vector xGenerated from some unknown function FGiven a loss function, the optimal decision y* is the class label that minimizes the expected loss.0-1 loss: the most probable label.Binary problem: class +, class –P(+|x) = 0.7 and P(-|x) = 0.3Predict +Cost-sensitive loss: choose the class label that reduces expected risk.P(fraud|x) * $1000 > $90Optimal label *y may not always be the true label.For example, 0-1 loss, P(+|x) = 0.6, the true label may be – with 0.4 probability
12Estimated Probability We use M to “approximate” true function F.We almost never know F.Estimated probability by a model M, P(y|x,M).The dependency on M is none-trivial:Decision tree uses tree structure and parameters within the structure to approximate P(y|x)Mixture model uses basis functions such as naïve Bayes and Gaussian.Relation between P(y|x,M) and P(y|x)?
13Important Observation If P(y|x,M) = P(y|x), the expected loss for any loss function will be the smallest.Interesting cases:P(y|x, M) = P(y|x) and 0-1 loss,100% accuracy?Yes, only if the problem is deterministic or P(y|x) =1 for the true label and 0 for all others!Otherwise, you can only choose the most likely label, but it can still be wrong for some examples.Can M beat the accuracy of P(y|x), even if P(y|x, M) =! P(y|x)?Yes, for some specific example or specific test set.But not in general or not “expected loss’’
14Reality Class labels are given, however P(y|x) is not given in any dataset unless the dataset is synthesized.Next Question: how to set the “true” P(y|x) for a realistic dataset?
15Choosing P(y|x) Naïve Approach Assume that P(y|x) is 1 for the true class label of x and 0 for all class labels.For example, two class problem + and –If x’s label is +, assume P(+|x) = 1 and P(-1|x) = 0Only true if the problem isdeterminisitic andnoise free.Rather strong assumption and may cause problems.X has true class label: +M1: P(+|x,M1) = 1, P(-|x,M1) =0M2: P(+|x,M2) = 0.8, P(-|x,M2) = 0.2Both M2 and M1 are correct.But Penalize M2
16Utility-based Choice of P(y|x) Definition: v is the probability threshold for model M to correctly predict the optimal label y* of x.If P(y*|x,M) > v, predict y*Assume *y to be the true class label of an example.Example, binary class, 0-1 lossv=0.5 or If P(y|x,M) > 0.5, predict yExample, credit card fraud cost-sensitive lossP(y|x,M) * $1000 > $90v = 90/1000 = 0.09In summary, we use[v, +1] as the range of true probability P(y|x)This is weaker than assuming P(y|x) = 1 for the true class label.
17Example Two class problem: Naïve assumption: P(y|x) = 1 for the correct label.0 for all others.We assume P(y|x) (0.5, 1]It includes “naïve assumption” P(y|x) = 1.We re-define some measurements to fix the problem of “penalty”.
18DesiderataIf P(y|x,M) [v, 1], the exact value is trivial, since we already predict the true label.When P(y|x,M) < v(x,M), the difference is important.Measures how far off we are from making the right decision.Take into account the loss function, since the goal is to minimize its expected value.
19Evaluating P(y|x,M) Improved MSE Cross-entropy: Square Error:Where [[a]] = min(a, 1)Cross-entropy:Undefined either when P(y|x.M) = 0or true probability P(y|x) = 0No relation to loss function.Reliability plots previously proposed and used such as Zadrozny and Elkan’02 (explain later)
20Synthetic DatasetTrue probability P(y|x) is known and can be used to measure the exact MSE.Standard Bias and Variance Decomposition of MSE
22Binary Dataset Donation Dataset: Send a letter to solicit donation. Costs 68c to send a letterCost-sensitive loss:P(donate|x) * amt(x) > 68cUsed MLR to estimate amt(x). Better results could be obtained by Heckman’s two-step procedure (Zadrozny and Elkan’02)
24Reliability Plot Divide score or output probability into bins Either equal size such as 10 or 100 bins.Or equal number of examples.For those examples in the same bin:Average the predicted probability of these examples, and call it bin_xDivide the number of examples with label y by the total number of examples in the bin, call it bin_yPlot (bin_x, bin_y)
27Multi-Class Dataset Artificial Character Dataset from UCI Class labels: 10 lettersThree loss functions:Top 1: the true label is the most probable letter.Top 2: the true label is among the two most probable letters.Top 3: the true label is among the top three.
31What we learned On studies of probability approximation: Assuming P(y|x)=1 is a very strong assumption and cause problems.Suggested a relaxed choice of P(y|x).Improved definition of MSE that takes into loss.Methodology part:Proposed a variation of Random Forest.
32Summary of Experiments Various experimentsSynthetic with true probability P(y|x)Binary and multi-class problemsReliability plots and MSE show that randomized approaches approximate P(y|x) significantly closer.Bias and Variance Decomp of Probability as compared to loss function.Reduction comes mainly from varianceBias is reduced as well
33What nextWe traditionally think that probability estimation is a harder problem than class labels:Simplified approach: naïve Bayes. Uncorrelated assumption.Finite mixture models: still based on assumption of basis function.Logistic regression: sensitive to example layout, and subjective use to categorical features.Bayes network: need knowledge about causal relations. NP-hard to find the optimal one.
34continuedWe show that rather simple randomized approaches approximate probability very well.Next step: is it time for us to re-design some better and simpler algorithms to approximate probability better?