Topic 12 An Introduction to Tree Models

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Chapter 7 Classification and Regression Trees
Random Forest Predrag Radenković 3237/10
CHAPTER 9: Decision Trees
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.
Decision Tree.
A Quick Overview By Munir Winkel. What do you know about: 1) decision trees 2) random forests? How could they be used?
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Sparse vs. Ensemble Approaches to Supervised Learning
Additive Models and Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Lecture 5 (Classification with Decision Trees)
Sparse vs. Ensemble Approaches to Supervised Learning
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Introduction to Directed Data Mining: Decision Trees
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Rasch trees: A new method for detecting differential item functioning in the Rasch model Carolin Strobl Julia Kopf Achim Zeileis.
K Nearest Neighbors Classifier & Decision Trees
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Linear Regression. Fitting Models to Data Linear Analysis Decision Trees.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Lecture Notes for Chapter 4 Introduction to Data Mining
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Classification and Regression Trees
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
CPH Dr. Charnigo Chap. 9 Notes To begin with, have a look at Figure 9.5 on page 315. One can get an intuitive feel for how a tree works by examining.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
k-Nearest neighbors and decision tree
Lecture 15. Decision Trees¶
LECTURE 20: DECISION TREES
Data Mining CAS 2004 Ratemaking Seminar Philadelphia, Pa.
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Boosting and Additive Trees
Supervised Learning Seminar Social Media Mining University UC3M
Introduction to Data Mining and Classification
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
ECE 471/571 – Lecture 12 Decision Tree.
Ungraded quiz Unit 5.
Random Survival Forests
Lecture 05: Decision Trees
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning Reminder - Bagging of Trees Random Forest
Classification with CART
LECTURE 18: DECISION TREES
INTRODUCTION TO Machine Learning 2nd Edition
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
STT : Intro. to Statistical Learning
Presentation transcript:

Topic 12 An Introduction to Tree Models Xiaogang Su, Ph.D. Department of Mathematical Sciences University of Texas at El Paso (UTEP) xsu@utep.edu

Outline What is a tree-structured model? History of tree modeling Why tree-based methods? The current standard of constructing trees – CART methodology Growing a large initial tree Pruning Tree size selection via validation Implementation An Example Discussion: Pros and Cons

1, What is a Tree Model? A tree-based method (or recursive partitioning) recursively partitions the predictor space to model the relationship between two sets of variables (responses vs. predictors). Tree modeling facilitates piecewise constant fitting. Two Types in Data Mining: Regression Trees (continuous response) Classification /Decision Trees (categorical response).

A Simple Regression Example

An Illustration Regression Tree LS Fitted Line

Some Tree Terminology I Root Node Splitting Rule YES NO Internal Node II III IV V VI Terminal Node (Leaf)

Outline What is a tree-structured model? History of tree modeling Why tree-based methods? The current standard of constructing trees - CART Growing a large initial tree Pruning Tree size selection via validation Implementation An Example Discussion: Pro and Cons

A Brief History Morgan and Sonquist (1963, JASA) –Automatic Interaction Detection (AID) Hartigan (1975) and Kass (1980) - CHAID Breiman, Friedman, Olshen and Stone (1984) – Classification And Regression Trees (CART) Another Similar Implementation - Quinlan (1979) - ID3 & C4.5 Extensions Multivariate Adaptive Regression Splines (MARS) – Friedman (1991); Hierarchical Mixtures of Experts (Jordan and Jacobs, 1994). Bagging (Breiman, 1996); Boosting (Freund and Schapire, 1996) and Friedman (2001) Random Forests - Breiman (2003)

Outline What is a tree-structured model? History of tree modeling Why tree-based methods? The current standard of constructing trees - CART Growing a large initial tree Pruning Tree size selection via validation Implementation An Example Discussion: Pro and Cons

Figure adapted from [SGI2001] Why Decision Trees? IF Petal-length > 2.6 AND Petal-width ≤ 1.65 AND Petal-length > 5 AND Sepal-length > 6.05 THEN the flower is Iris Virginia. This is not the only rule for this species. What is the other? path Figure adapted from [SGI2001]

Interpretability This is better than saying: Example: When a bank rejects a credit card application, it is better to explain to the customer that it was due to the fact that He/she is not a permanent resident of Australia AND He/she has been residing in Australia for < 6 months AND He/she does not have a permanent job. This is better than saying: “We are very sorry, but our neural network thinks that you are not a credit-worthy customer.” (In which case the customer might become angry and move to another bank) Interpretability is one of the main advantages of decision trees. Other pros and cons will be discussed later.

Outline What is a tree-structured model? History of tree modeling Why tree-based methods? CART methodology of constructing trees Growing a large initial tree Pruning Tree size selection via validation Implementation An Example Discussion: Pro and Cons

Building a Tree Model How to split data or grow trees? Minimizing within-node impurity Maximizing between-node difference How to declare a terminal node and determine the optimal tree size? Pre-pruning or stopping rule Pruning – backward deletion plus cross validation How to summarize each terminal node? Average or majority voting

Splitting A split is preferable if observations within the resultant child nodes become more homogeneous or less impure in terms of the response values or classes. Impurity Measures: misclassification rate, entropy, Gini index Equivalently, after the split, observations between the two resultant child nodes become more heterogeneous. Any appropriate two-sample test statistic can be used as such a measure of Goodness of Split. Maximize the measure over all permissible splits to identify the best one. Lung Cancer 50% vs. 50% YES NO Left right 90% : 10% 10% : 90%

Splitting Criterion – Regression Trees Minimizing Within-Node Impurity Impurity measure – SSE: The best split minimizes within-node impurity: Maximizing Between-Node Difference Two-sample t or F test measures the between-node diff: The best split maximizes |t| or Also can be casted in a linear model and test . These two above approaches are equivalent and provide exactly the same best split.

Possible Splits induced by AGE Two Sample t Test Statistic An Illustration of a Single Split Original Data Possible Splits induced by AGE Splits Induced by RACE id QoL Age Race AGE <= 17? AGE <= 23? AGE <= 32? AGE <= 40? AGE <= 45? AGE <= 67? AGE <= 76? Race in B? Race is W? Race in {O}? 1 16.7 17 B L R 2 18.9 76 W 3 25 23 4 29.45 5 97.8 45 6 75.6 67 O 7 23.4 32 8 45.6 84 9 78.6 40 10 65.3 Two Sample t Test Statistic 1.1248 1.9175 2.7894 -1.0446 0.0621 0.8111 0.0688 0.1521 -0.2819 0.0946 Absolute Value 1.0446 0.2819 The best split

Splitting Criterion – Decision Trees Impurity Measures: Misclassification Rate Entropy Gini Index

Splitting Criterion – Decision Trees Maximizing between-node difference by applying an appropriate two-sample test statistic (Pearson or Likelihood Ratio Test) Chi-squared Test Also can be casted in a logistic regression model and test hypothesis . Equivalence between Two Approaches: If the likelihood ratio test statistic is used, the criterion obtained is equivalent to the impurity reduction criterion based on entropy. If linear model is fit with least squares (same as in regression trees), the criterion obtain is equivalent to impurity reduction criterion based on the Gini index.

Why binary splits? A multi-way split can always be obtained by applying several binary split. It is hard to achieve optimality with multi-way splits due to increases number of allowable splits and difficulty in comparing across multi-way splits. A binary split A multi-way split

Several Other Issues Split on a categorical variable ‘Ordinal’ize the variable first when it has too many levels Select the splitting variable first and then apply greedy search to identify the best cutoff point Split with random sampling Stop splitting till a large initial tree is obtained.

Pre-Pruning and Post-Pruning Pre-Pruning: the problematic stopping rules Underfitting or Overfitting Post-Pruning - CART First growing a large initial tree and one of its subtrees is to be selected as the best tree structure. To narrow down the choice of subtrees, a pruning algorithm is applied to iteratively truncate the initial large tree into a sequence of nested subtrees. Test sample or v-fold cross-validation is then used to select the best subtree or determine the best tree size.

Pruning Any subtree of the initial tree could be the final tree model. A tree T1 is called a subtree of tree T, if every node of T1 is included in T. However, the number of subtrees increases rapidly with the size of the initial tree. To see this, number of subtrees are 1, 2, 5, 26, 677, 458330 … for full trees of depth 1, 2, … Breiman et al. (1984) proposed an efficient pruning algorithm to help lower down the number of candidate subtrees.

A Diagram for Pruning pruning t t t (a) tree (b) branch (c) pruned subtree

Tree Size Determination The best tree is to be selected, via some validation method, from the subtree sequence obtained from the pruning algorithm. Validation Methods Test Sample Cross-Validation or Bootstrap resampling techniques Selection Criteria GCV, AIC, BIC, etc.

Outline What is a tree-structured model? History of tree modeling Why tree-based methods? CART methodology of constructing trees Growing a large initial tree Pruning Tree size selection via validation Implementation An Example Discussion: Pro and Cons

Implementation in SAS and R R Packages Available tree Functions: tree(), prune.tree(), cv.tree(), predict.tree() Rpart Functions: rpart(), prune.rpart(), post.rpart () Party – similar to CHAID Functions: ctree() Package c5.0 Function C5.0() SAS Implementation SAS Enterprise Miner PROC SPLIT

Outline What is a tree-structured model? History of tree modeling Why tree-based methods? The current standard of constructing trees – CART methodology Growing a large initial tree Pruning Tree size selection via validation Implementation An Example

The Hospitalization Data Obtained from a Healthcare Consulting Company N = 24,935 and p=83 predictors. To predict whether an insured has hospitalization in the following year based on his/her demographics and previous- year info.

R: Initial Large Tree Tree size or the Number of terminal nodes is 207

Tree Size Selection via Validation Sample Best Tree Size = 12

The Best Tree (Size = 12) The Final Tree Structure node), split, n, deviance, yval, (yprob) ----------- * denotes terminal node 1) root 14955 6803.00 0 ( 0.93982 0.06018 ) 2) risk_p_cost < 0.681696 10037 2821.00 0 ( 0.96832 0.03168 ) 4) risk_c_cost < 0.125227 6327 1329.00 0 ( 0.97819 0.02181 ) 8) age < 38.5 2602 300.40 0 ( 0.98962 0.01038 ) * 9) age > 38.5 3725 998.60 0 ( 0.97020 0.02980 ) * 5) risk_c_cost > 0.125227 3710 1440.00 0 ( 0.95148 0.04852 ) 10) allow_current_total < 1824.49 2672 872.70 0 ( 0.96145 0.03855 ) * 11) allow_current_total > 1824.49 1038 548.70 0 ( 0.92582 0.07418 ) * 3) risk_p_cost > 0.681696 4918 3576.00 0 ( 0.88166 0.11834 ) 6) Preg_Incomplete < 0.5 4783 3264.00 0 ( 0.89254 0.10746 ) 12) risk_p_cost < 4.48499 4238 2497.00 0 ( 0.91340 0.08660 ) 24) risk_p_cost < 2.34497 3069 1578.00 0 ( 0.92864 0.07136 ) 48) NC_OVR14 < 0.5 2959 1438.00 0 ( 0.93410 0.06590 ) 96) allow_current_prof < 2686.78 2260 949.60 0 ( 0.94602 0.05398 ) * 97) allow_current_prof > 2686.78 699 467.90 0 ( 0.89557 0.10443 ) * 49) NC_OVR14 > 0.5 110 115.40 0 ( 0.78182 0.21818 ) * 25) risk_p_cost > 2.34497 1169 888.20 0 ( 0.87340 0.12660 ) * 13) risk_p_cost > 4.48499 545 635.50 0 ( 0.73028 0.26972 ) 26) allow_current_med < 4205.6 147 92.46 0 ( 0.90476 0.09524 ) * 27) allow_current_med > 4205.6 398 507.10 0 ( 0.66583 0.33417 ) 54) MD_visitcnt200703 < 35.5 349 425.20 0 ( 0.70201 0.29799 ) * 55) MD_visitcnt200703 > 35.5 49 66.27 1 ( 0.40816 0.59184 ) * 7) Preg_Incomplete > 0.5 135 187.10 1 ( 0.49630 0.50370 ) *     Variables actually used in tree construction: "risk_p_cost" "risk_c_cost" "age" "allow_current_total "Preg_Incomplete" "NC_OVR14" "allow_current_prof" "allow_current_med" "MD_visitcnt200703"  

Best Tree

ROC (Test Sample)

Outline What is a tree-structured model? History of tree modeling Why tree-based methods? CART methodology of constructing trees Growing a large initial tree Pruning Tree size selection via validation Implementation An Example Discussion: Pro and Cons

Pros It is nonparametric requiring few statistical assumptions and hence robust. It ameliorates the “curse of dimensionality” (a term due to Richard Bellman) and can be applied to various data structures involving both ordered and categorical variables in a simple and natural way. In particular, the recursive partitioning method is exceptionally efficient in handling categorical predictors.

Pros (Continued) It does variable selection, complexity reduction, and (implicit) interaction handling in an automatic manner. Invariant under all monotone transformations of individual ordered predictors The output gives easily understood and interpreted information. Special features are available in handling missing values and obtaining ranking of variables in terms of their importance.

Cons Sometimes, not doing so well for estimation or prediction tasks. Instability of tree models These two weaknesses can be utilized and improved by MARS, bagging, boosting, and random forests (Perturb & Ensemble procedures). The best split is locally optimal at each node – the tree structure may not be globally optimal. Biased variable selection, e.g., a categorical variable with many levels is more likely to be selected than a binary variable even though neither is related to the response.