# Decision trees for hierarchical multilabel classification A case study in functional genomics.

## Presentation on theme: "Decision trees for hierarchical multilabel classification A case study in functional genomics."— Presentation transcript:

Decision trees for hierarchical multilabel classification A case study in functional genomics

Work by Hendrik Blockeel Leander Schietgat Jan Struyf Katholieke Universiteit Leuven (Belgium) Amanda Clare University of Aberystwyth (Wales) Sašo Džeroski Jozef Stefan Institute Ljubljana (Slovenia)

Overview Hierarchical Multilabel Classification task description Predictive Clustering Trees for HMC the algorithm: Clus-HMC Evaluation on yeast datasets

Hierarchical multilabel classification (HMC) Classification predict class for unseen instances based on (classified) training examples HMC instance can belong to multiple classes classes are organised in a hierarchy Example toy hierarchy Advantages efficiency skewed class distributions hierarchical relationships 1 (1) 3 (5) 2 (2) 2/1 (3) 2/2 (4)

Predictive clustering trees ~ decision trees [Blockeel et al. 1998] each node (including leaves) is a cluster tests in nodes are descriptions of clusters Heuristic minimize intra-cluster variance maximise inter-cluster variance Can be extended to perform HMC distance measure d (quantifies similarity) prediction function p (maps a cluster in a leaf onto prediction)

Instantiating d Class labels are represented in a vector v i = [1,1,0,1,0] (1) (2) (3) (4) (5) Distance between vectors is defined as the component-wise Euclidean distance: d(x 1,x 2 ) = √ ∑ k w k (v 1,k – v 2,k ) 2 1 (1) 3 (5) 2 (2) 2/1 (3) 2/2 (4) (w k = w depth(c k ) ) Example S i = {1,2,2/2}, S j = {2} d Eucl ([1,1,0,1,0],[0,1,0,0,0]) = sqrt(w + w²)

Instantiating p Each leaf contains multiple classes (organised in a hierarchy) Which classes to predict? binary classification: predict positive if the instance ends up in a leaf with at least 50% positives multilabel classification: skewed class distributions Threshold an instance ending up in some leaf is predicted to belong to class c i if v i  t i, with v i the proportion of instances in the leaf belonging to c i, and t i some threshold by varying threshold, we obtain different points on the precision-recall curve

Clus-HMC algorithm Pseudo code stopping criterion

Experiments in yeast functional genomics Saccharomyces cerevisiae or baker’s/brewer’s yeast MIPS FunCat hierarchy function of yeast genes 12 data sets [Clare 2003] Sequence structure (seq) Phenotype growth (pheno) Secondary structure (struc) Homology search (hom) Microarray data cellcycle, church, derisi, eisen, gasch1, gasch2, spo, expr (all) 1 METABOLISM 1/1 amino acid metabolism 1/2 nitrogen and sulfur metabolisms … 2 ENERGY 2/1 glycolysis and gluconeogenesis …

Experimental evaluation Objectives Comparison with C4.5H [Clare 2003] Evaluation of the improvement obtainable with HMC trees over single classification trees Evaluation with precision-recall curves precision recall advantages = TP / Yes = TP / (TP+FP) = TP / + = TP / (TP+FN)

Comparison with C4.5H C4.5H = hierarchical multilabel extension of C4.5 [Clare 2003] Designed by Amanda Clare Heuristic: information gain adaptation of entropy (sum of all classes) Prediction: most frequent set of classes + significance test Clus-HMC method Tuning: different F-tests on validation data, choose F-test with highest AUPRC

Comparison between Clus-HMC and C4.5H Average case

Comparison between Clus-HMC and C4.5H Specific classes 25 wins (II), 6 losses (IV) III IIIIV

Comparing rules e.g. predictions for class 40/3 in “gasch1” data set C4.5H: two rules Clus-HMC (most precise rule) IF 29C_Plus1M_sorbitol_to_33C_Plus_1M_sorbitol_ __15_minutes <= 0.03 AND constant_0point32_mM_H202_20_min_redo <= 0.72 AND 1point5_mM_diamide_60_min <= -0.17 AND steady_state_1M_sorbitol > -0.37 AND DBYmsn2_4__37degree_heat___20_min <= -0.67 THEN 40/3 IF Heat_Shock_10_minutes_hs_1 <= 1.82 AND Heat_Shock_030inutes__hs_2 <= -0.48 AND 29C_Plus1M_sorbitol_to_33C_Pl us_1M_sorbitol___5_minutes > -0.1 THEN 40/3 IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28 THEN 40/3 Precision: 0.52 Recall: 0.26 Precision: 0.56 Recall: 0.18 Precision: 0.97 Recall: 0.15

HMC vs. single classification Method Average case

HMC vs. single classification Specific classes numbers are AUPRC(Clus-HMC) – AUPRC(Clus-SC) HMC performs better!

Conclusions Use of precision-recall curves to optimize the learned models and to evaluate the results Improvement over C4.5H HMC compared to SC Comparable predictive performance Faster Easier to interpret

References Hendrik Blockeel, Luc De Raedt, Jan Ramon, Top-down induction of clustering trees (1998) Amanda Clare, Machine learning and data mining for yeast functional genomics, Doctoral dissertation (2003) Jan Struyf, Sašo Džeroski, Hendrik Blockeel, Amanda Clare, Hierarchical multi- classification with predictive clustering trees in functional genomics (2005)

Questions?

Download ppt "Decision trees for hierarchical multilabel classification A case study in functional genomics."

Similar presentations