CART: Classification and Regression Tree

CART: Classification and Regression Tree

Motivation: Development of a reliable clinical decision rule, which can be used to classify new patients into clinically-important categories or risk categories so that appropriate decisions can be made regarding patient management.

Example 1: Molecular abnormalities in the major psychiatric illnesses: Classification and Regression Tree (CRT) analysis of post-mortem prefrontal markers. M B Knable et al Molecular Psychiatry, 2002, Volume 7, Number 4, Pages

Post-mortem specimens from the Stanley Foundation Neuropathology Consortium, which contains matched samples from patients with schizophrenia, bipolar disorder, non-psychotic depression and normal controls (n = 15 per group), have been distributed to many research groups around the world. This paper provides a summary of abnormal markers found in prefrontal cortical areas from this collection between 1997 and With parametric analyses of variance of 102 separate data sets, 14 markers were abnormal in at least one disease.

The markers pertained to a variety of neural systems and processes including neuronal plasticity, neurotransmission, signal transduction, inhibitory interneuron function and glial cells. The data sets were also examined using the non-parametric Classification and Regression Tree (CRT) technique for the four diagnostic groups and in pair-wise combinations. In contrast to the results obtained with analyses of variance, the CRT method identified a smaller set of nine markers that contributed maximally to the diagnostic classifications.

Three of the nine markers observed with CRT overlapped with the ANOVA results. Six of the nine markers observed with the CRT technique pertained to aspects of glutamatergic, GABA-ergic, and dopaminergic neurotransmission.

Example 2: Sperm morphology, motility, and concentration in fertile and infertile men. Guzick, DS; Overstreet, JW; Factor-Litvak, P, et al. New England Journal of Medicine, 2001, Volume 345, Issue 19, Pages

Background Although semen analysis is routinely used to evaluate the male partner in infertile couples, sperm measurements that discriminate between fertile and infertile men are not well defined.

Methods We evaluated two semen specimens from each of the male partners in 765 infertile couples and 696 fertile couples at nine sites. The female partners in the infertile couples had normal results on fertility evaluation. The sperm concentration and motility were determined at the sites; semen smears were stained at the sites and shipped to a central laboratory for an assessment of morphologic features of sperm with the use of strict criteria.

We used classification-and-regression-tree analysis to estimate threshold values for subfertility and fertility with respect to the sperm concentration, motility, and morphology. We also used an analysis of receiver-operating-characteristic curves to assess the relative value of these sperm measurements in discriminating between fertile and infertile men.

Results The subfertile ranges were a sperm concentration of less than 13.5x106 per milliliter, less than 32 percent of sperm with motility, and less than 9 percent with normal morphologic features. The fertile ranges were a concentration of more than 48.0x106 per milliliter, greater than 63 percent motility, and greater than 12 percent normal morphologic features. Values between these ranges indicated indeterminate fertility.

There was extensive overlap between the fertile and the infertile men within both the subfertile and the fertile ranges for all three measurements. Although each of the sperm measurements helped to distinguish between fertile and infertile men, none was a powerful discriminator. The percentage of sperm with normal morphologic features had the greatest discriminatory power.

Conclusions Threshold values for sperm concentration, motility, and morphology can be used to classify men as subfertile, of indeterminate fertility, or fertile. None of the measures, however, are diagnostic of infertility.

Components of classification problem: 1. Outcome or “dependent” variable. They can be continuous, categorical (ordinal or nominal), or time-to event variables, e.g.: a) Example 1: b) Example 2: c) Blood pressure, Patient survival, need for surgery, presence of myocardial infarction, medication compliance:

2. Predictor or independent variables a) Example 1: b) Example 2: c) Blood, Patient survival, need for surgery, presence of myocardial infarction, medication compliance:

3. Learning data set: this is a dataset which includes values for both the outcome and predictor variables, from a group of patients similar to those for whom we would like to be able to predict outcomes in the future.

4. Test data set: it consists of patients for whom we would like to be able to make accurate predictions. This test dataset may or may not exist in practice. However, a separate test dataset is not always required to determine the performance of a decision rule.

A decision problem can include two other factors to be considered: a “prior” probability for each outcome, which represents the probability that a randomly-selected future patient will have a particular outcome. decision cost or loss function, which represents the inherent cost associated with the error in prediction:

For example, it is a much more serious error to classify a patient with an emergent medical condition as non-urgent, than to misclassify a patient with a non-urgent medical condition as urgent.

Features of CART: Nodes: parent node, child node, root node, terminal node. Binary Splits: a “node” in a decision tree, can only be split into two groups

Each Split based on only one variable. Recursive partition: binary partitioning process can be applied over and over again. Each parent node can give rise to two child nodes and, in turn, each of these child nodes may themselves be split, forming additional children.

To construct a CART: Tree building: a tree is built using recursive splitting of nodes. Build a “maximal” tree based on some ‘stopping rule’: a “maximal” tree has been produced, which probably greatly overfits the information contained within the learning dataset. Optimal tree selection: the tree which fits the information in the learning dataset, but does not overfit the information, is selected from among the sequence of pruned trees.

TREE BUILDING: Tree building begins at the root node, which includes all patients in the learning dataset. Beginning with this node, the CART software finds the best possible variable to split the node into two child nodes. In order to find the best variable, the software checks all possible splitting variables (called splitters), as well as all possible values of the variable to be used to split the node. In choosing the best splitter, the program seeks to minimize the total “purity” of the two child nodes.

Measure of impurity of a node: p= (p1, p2, p3, …, pk) Information, or Entropy: E= Sum (p log(p)), with 0log0 = 0 Gini Index: G=1-Sum (p*p) We define the impurity of a tree to be the sum over all terminal nodes of the impurity of a node multiplied by the proportion of cases that reach that node of the tree

The predicted class assigned to each terminal node depends on three factors: Assumed prior probability of each class within future datasets; Decision loss or cost matrix; and Fraction of subjects with each outcome in the learning dataset that end up in each node.

Stop Tree Building The tree building process goes on until it is impossible to continue. The process is stopped when: All observations within each child node have the identical distribution of predictor variables, making splitting impossible.

An external limit on the number of levels in the maximal tree has been set by the user. An external limit on the size of the terminal nodes in the maximal tree has been reached.

Tree Pruning In order to generate a sequence of simpler and simpler trees, each of which is a candidate for the appropriately-fit final tree, the method of Reduced Error Pruning can be used. Each time we remove the ‘weakest link’ that provides least improvement in misclassification. The misclassification error gradually increased during the pruning process. More generally pruning is possible: cost complexity pruning.

Optimal Tree Selection The maximal tree will always fit the learning dataset with higher accuracy than any other tree, because the maximal tree is constructed to optimize its performance based on the learning dataset.

The goal in selecting the optimal tree, defined with respect to expected performance on an independent set of data, is to find the correct complexity parameter so that the information in the learning dataset is fit but not overfit. In general, finding this value for would require an independent set of data, but this requirement can be avoided using the technique of cross validation

The figure below shows the relationship between tree complexity, reflected by the number of terminal nodes, and the decision cost for an independent test dataset and the original learning dataset.

As the number of nodes increases, the decision cost decreases monotonically for the learning data. This corresponds to the fact that the maximal tree will always give the best fit to the learning dataset. In contrast, the expected cost for an independent dataset reaches a minimum, and then increases as the complexity increases. This reflects the fact that an overfitted and overly complex tree will not perform well on a new set of data.

Cross-Validation Cross validation is a computationally-intensive method for validating a procedure for model building, which avoids the requirement for a new or independent validation dataset. In cross validation, the learning dataset is randomly split into N sections. One of these subsets of data is reserved for use as an independent test dataset, while the other N-1 subsets are combined for use as the learning dataset in the model-building procedure.

The entire model-building procedure is repeated N times, with a different subset of the data reserved for use as the test dataset each time. Thus, N different models are produced, each one of which can be tested against an independent subset of the data. The amazing fact on which cross validation is based is that the average performance of these N models is an excellent estimate of the performance of the original model (produced using the entire learning dataset) on a future independent set of patients.

Advantage: No parametric assumption (in contrast to linear regression, logistic regression, or cox’s proportional hazard model) Can cope with any data type (continuous, binary, ordinal, nominal) Classification has a simple form that’s easy to understand

CART identifies “splitting” variables based on an exhaustive search of all possibilities. Since efficient algorithms are used, CART is able to search all possible variables as splitters, even in problems with many hundreds of possible predictors. It handles complex interactions well. For example, the value of one variable (e.g., age) may substantially affect the importance of another variable (e.g., weight). Is robust with respect to outliers. Provides an estimate of the misclassification rate.

Disadvantage: CART does not use combinations of variable in each split Tree structures may be unstable – a change in the sample may give different trees Tree is optimal at each split – it may not be globally optimal.

Reference: 1. Classification and regression trees Leo Breiman, Jerome H. Friedman, Richard Olshen, and Charles J. Stone Brooks/Cole Publishing, Monterey, 1984 2. An Introduction to Classification and Regression Tree (CART) Analysis Roger J. Lewis Presented at the 2000 Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, California 3. Modern Applied Statistics with S WN Venables, BD Ripley Springer 2002

CART: Classification and Regression Tree

Similar presentations

Presentation on theme: "CART: Classification and Regression Tree"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CART: Classification and Regression Tree

Similar presentations

Presentation on theme: "CART: Classification and Regression Tree"— Presentation transcript:

Similar presentations

About project

Feedback