Presentation on theme: "Discovering Unrevealed Properties of Probability Estimation Trees: on Algorithm Selection and Performance Explanation Discovering Unrevealed Properties."— Presentation transcript:
Discovering Unrevealed Properties of Probability Estimation Trees: on Algorithm Selection and Performance Explanation Discovering Unrevealed Properties of Probability Estimation Trees: on Algorithm Selection and Performance Explanation Kun Zhang, Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia Xu Dec. 21, 2006
What this Paper Offers Preference of a probability estimation tree (PET) Many important and previously unrevealed properties of PETs A practical guide for choosing the most appropriate PET algorithm
Statistical Supervised Leaning Bayesian Decision Rule Since 0-1 Loss: Cost-Sensitive Loss: Why Probability Estimation? - Theoretical Necessity Challenge: P(x,y) is unknown so is P(y|x)! Model Learning Algorithm f(x,θ) LS= Unknown Distribution P(X,Y), y=F(x) A Loss/Cost function L f :
Why Probability Estimation? - Why Probability Estimation? - Practical Necessity Medical Domain Ozone level Prediction Direct Marketing Non-static, skewed distribution Unequal loss (Yang,icdm05) Direct estimation of Probability Decision threshold determination
Posterior Probability Estimation Parametric Methods Non- Parametric Approaches The true and unknown distribution follows a particular form. Via maximum likelihood estimation E.g. Naïve Bayes, logistic regression Directly calculated without making any assumption E.g. Decision trees, Nearest neighbors Posterior Probability Estimation a rather unbiased, flexible and convenient solution
PETs - Probabilistic View of Decision Trees, E.g. (C4.5, CART) Confidences in the predicted labels Appropriately thesholding for classification w.r.t. different loss functions. The dependence of P(y|x,θ) on θ is non- trivial
Problems of Traditional PETs 1.Probability estimates through frequency tend to be too close to the extremes of 1 and 0 ------------------------------------------------ - 2.Additional inaccuracies result from the small number of examples within a leaf. ------------------------------------------------ - 3.The same probability is assigned to the entire region of space defined by a given leaf. C4.4 (Provost,03) CFT (Ling,03) BPET (Breiman,96), RDT (Fan,03)
Popular PET Algorithms PET Algorithms Single or Multiple Model(s) Feature Selection Criterion Probability Estimation Method Pruning Strategy Diversity Acquisition C4.5 (Quinlan,93) SingleGain Ratio Frequency Estimation Error- based Pruning N/A C4.4 (Provost,03) SingleGain Ratio Laplace Correction No CFT (Ling,03) SingleGain Ratio Aggregation over Leaves (FE/LC at leaf node) No or Error- based Pruning RDT (Fan,03) Multiple Randomly Chosen Model Averaging No or Depth Constraint Random Manipulation of feature set BPET (Breiman,96) MultipleGain Ratio Model Averaging No Random Manipulation of training set Which one to choose? What performances to be expected ? Why should one PET be preferred over another?
Contributions A large scale learning curve study using multiple evaluation metrics Preference of a PET: signal-noise separability of datasets Many important and previously unrevealed properties of PETs: In ensembles, RDT is preferable on low-signal separability datasets, while BPET is favorable when the signal separability is high. A practical guide for choosing the most appropriate PET algorithm
Analytical Tool # 1: AUC - Analytical Tool # 1: AUC - Index of Signal Noise Separability Signal-noise separability - Correct identification of information of interest and some other noise factors which may interfere this identification. - A good analogy for two different populations present in every learning domain with uncertainty A synthetic scenario – tumor diagnosis - Tumor: signal present - No tumor: signal absent - Based on yes/no decision 1. P(yes|tumor): hit (TP) 2. P(yes|no tumor): false alarm (FP) 3. P(no|tumor): miss (FN) 4. P(no|no tumor): correct reject (TN)
-8-8 -6-6 -4-4 -2-2 02468 0 0. 0 5 0.1 0. 1 5 0.20.2 0. 2 5 f(x|signal) f(x|noise 1 ) f(x|noise 2 ) Noise Signal Miss Correct reject False alarm Hit Decision Criterion An Illustration Relative areas of the four different outcomes vary, the separation of the two distribution does not ! Analytical Tool # 1: AUC - Analytical Tool # 1: AUC - Index of Signal Noise Separability AUC: an index for the separability of signal from noise Domains: high/low degree of signal separability High: deterministic/ little noise Low: Stochastic/Noisy
Analytical Tool # 2: Analytical Tool # 2: Learning Curves Instead of CV or training-test splitting based on fixed data set size Generalization performance of different models as a function of the size of the training set Correlation between performance metrics and training set sizes can be observed and possibly generalized over different data sets.
1.Area Under ROC Curve (AUC) - Summarizes the ranking capability of a learning algorithm in ROC space 2.MSE (Brier Score) - - A proper assessment for the accuracy of probability estimation - Calibration-Refinement decomposition * * Calibration measures the absolute precision of probability estimation * Refinement indicates how confident the estimator is in its estimates * Visualization tools – reliability plots and sharpness graphs 3.Error Rate - Inappropriate criterion for evaluating probability estimates - Analytical Tool # 3: Multiple Evaluation Metrics
1. RDT and CFT are better on AUC 2. RDT is preferable on low-signal separability datasets, While BPET is favorable on high- signal separability data sets 3. High separability categorical datasets with limited feature values hurt RDT 4. Among single trees, CFT is preferable on low-signal separability datasets Conjectures in Summary
Behind the Scenes Behind the Scenes - Why RDT and CFT better on AUC? Superior capability on unique probability generation Unique Probabilities (Win-Loss-Tie) AVGRDTC4.4C4.5CFT BagPET0-14.9-3.118-0-0 11.6-4-2.4 RDT18-0-0 16-0.6-1.4 C4.417.9-0-0.10.1-15.2-2.7 C4.50-17.3-0.7 STDEVRDTC4.4C4.5CFT BagPET0-0.9-0.90-0-0 1.9-1.8-0.5 RDT0-0-0 0.8-0.5-0.7 C4.40.3-0-0.30.3-1.6-1.3 C4.50-1.9-1.9 AUC calculations: Trapezoidal integration (Fawcett,03) (Hand,01) For larger AUC, P(y|x,θ) should vary from one test point to another The number of unique probabilities is maximized as a result RDT > BPET > CFT > C4.4 > C4.5
Behind the Scenes - Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? 1.RDT: discards any criterion for optimal feature selection 2.More like a structure for data summarization. 3.When the signal-separability is low, this property protects RDT from the danger of identifying noise as signal or overfitting on noise, which is very likely to be caused by massive searches or optimization adopted by BPET. 4.RDT provides an average of probability estimation which approaches the mean of true probabilistic values as more individual trees added. The reasons:
Behind the Scenes - Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? The evidence (I) – Spect and Sonar, low-signal separability domains
Behind the Scenes - Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? The evidence (II) – Pima, a low-signal separability domain RDT: BPET:
Behind the Scenes Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? The evidence (III) - Spam, a high-signal separability domain RDT: BPET:
Behind the Scenes Behind the Scenes - Why high separability categorical datasets with limited feature values hurt RDT? The observations – Tic-tac-toe and Chess
Behind the Scenes - Behind the Scenes - Why high separability categorical datasets with limited feature values hurt RDT? The reason: High separability categorical datasets with limited values tend to restrict the degree of diversity that RDTs random feature selection can explore - Random feature selection mechanism of RDT Categorical features: once; Continuous features: multiple times, but different splitting value each time.
The reasons 1. Low-signal separability domains Good performance benefits from the probability aggregation mechanism Rectify errors introduced to the probability estimates due to the attribute noise 2. High-signal separability domains Aggregation of the estimated probabilities from the other irrelevant leaves will adversely affect the final probability estimates. Behind the Scenes Behind the Scenes - Why CFT preferable on low-signal separability datasets ?
The evidence (I) – Spect and Pima, low-signal separability domains Behind the Scenes Behind the Scenes - Why CFT preferable on low-signal separability datasets ?
CFT: C4.4: The evidence (II) - Liver, a low-signal separability domain 20
AUC Score Given dataset Signal-noise separability estimation through RDT or BPET Ensembl e or Single trees Low signal- noise separability High signal-noise separability Ensemble or Single trees Ensemble (AUC,MSE, ErrorRate) RDT CFT Single Trees (AUC,MSE, ErrorRate) >=0.9< 0.9 Ensemble Single Tree AUC MSE Error Rate CFT AUC MSE, ErrorRate C4.5 or C4.4 Feature types and value characteristic s Categorical feature (with limited values) BPET RDT ( BPET) Continuous features (categorical feature with a large number of values) AUC, MSE, ErrorRate Choosing the Appropriate PET Algorithm Given a New Problem
Summary AUC: i AUC: index of signal noise separability Preference of a PET on multiple evaluation metrics signal-noise separability of the dataset other observable statistics. Many important and unrevealed properties of PETs are analyzed A practical guide for choosing the most appropriate PET algorithm