BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 “BOF” Trees Diagram as a Visual Way to Improve Interpretability of Tree Ensembles Vesna Luzar-Stiffler, Ph.D. University Computing Centre, and CAIR Research Centre, Zagreb, Croatia Charles Stiffler, Ph.D. CAIR Research Centre, Zagreb, Croatia
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Outline Introduction/Background Trees Ensemble Trees Visualization Tools Simulation Results Web Survey Results Conclusions/Recommendations
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Introduction / Background Classification / Decision Trees Data mining (statistical learning) method for classification Invented twice: Statistical community: Breiman: Friedman et.al. (1984) Machine Learning community: Quinlan (1986) Many positive features Interpretability, ability to handle data of mixed type and missing values, robustness to outliers, etc. Disadvantage unstable vis-à-vis seemingly minor data perturbations low predictive power
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Introduction / Background Possible improvements: Ensembles Bagging i.e., Bootstraping trees (Breiman, 1996) Boosting, e.g., AdaBoost (Freund & Schapire, 1997) Random Forests (Breiman, 2001) Stacking, randomized trees, etc. Advantage: Improved prediction Disadvantage Loss of interpretability (“black box”)
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Classification Tree Let be the classification tree prediction at input x obtained from the full “training” data Z= {(x 1,y 1 ),(x 2,y 2 )…(x N,y N )}
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Bagging Classification Tree Let be the classification tree prediction at input x obtained from the bootstrap sample Z* b, b=1,2,…B. Bagging estimate: 1 2 B
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Visualization tools Graphs based on predictor “importances” (Bxp) matrix F (p=# of predictors) For bagged trees, we take the avg: Diagram 1, importance mean bar chart Diagram 2, (“BOF Clusters”) is the cluster means chart (NEW) Diagram 3, (“BOF MDPREF”) is the multidimensional preference bi-plot (NEW)
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Visualization tools Graphs based on proximity (nxn) matrix P, (n=# of cases) Diagram 4 (“Proximity Clusters”) is the cluster means chart (Breiman,2002) Diagram 5 (“Proximity MDS”) is the multidimensional scaling plot of “similar” cases (Breiman,2002)
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Simulation experiments S1: Generate a sample of size n=30, two classes, and p=5 variables (x 1 -x 5 ), with a standard normal distribution and pair-wise correlation The responses are generated according to Pr(Y=1|x 1 ≤0.5) = 0.2, Pr(Y=1|x 1 >0.5)=0.8. S2: Generate a sample of size n=30, two classes, and p=5 variables (x 1 -x 5 ), with a standard normal distribution and pair-wise correlation 0.95 between x 1 and x 2, and 0 among other predictors. The responses are generated according to Pr(Y=1|x 1 ≤0.5) = 0.2, Pr(Y=1|x 1 >0.5)=0.8.
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Diagram 1, Mean importance S1 S2
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Diagram 2, “BOF Clusters” S1 S2
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Diagram 3, “BOF MDPREF” S1 S2
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Diagram 4, “Proximity Clusters” S1 S2
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Web Survey data ICT infrastructure/usage in Croatian primary and secondary schools 25,000+ teachers (cases) 200+ variables Response: “classroom use of a computer by educators” (yes/no) Partition 50% training 25% validation 25% test
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Initial tree (before bagging)
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Diagram 1, “Mean importance”
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Diagram 2, “BOF Clusters”
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Diagram 3, “BOF MDPREF”
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Bootstrap tree 11
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Bootstrap tree 22
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Bootstrap tree 12
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Clustering trees
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Diagram 5, “Proximity MDS”
BOF Trees Visualization Zagreb, June 12, 2004 BOF Trees Visualization Zagreb, June 12, 2004 Conclusions/ Recommendations There are SWs for trees There are some SWs for tree ensembles There are some visualization tools (old and new) The problem is they are not “interfaced” (integrated)