Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.

Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician http://gcrc.humc.edu/Biostat Session 5: Classification Trees: An Alternative to Logistic Regression

Case Study Goal of paper: Classify subjects as IR or non-IR using subject characteristics other than definitive IR such as clamp: BMI, HOMA, LDL, triglycerides, waist circumference, DBP, family history of diabetes, and others ? (SBP, HDL ?; entire list not clear).

Major Conclusion Using All Predictors 27.528.9 3.60 4.65 BMI HOMA IR Non-IR... p 336, 1st column:

Overview of Method: Classification Trees General concept (details later) based on groupings: 1.Form combinations of subgroups according to High or Low on each characteristic. 2.Find actual IR rates in each subgroup. 3.Combine subgroups that give similar IR rates. 4.Classify as IR if IR rate is large enough. Note: 1.No model or statistical assumptions. 2.And so no p-values. 3.Many options are involved in grouping details. 4.Actually implemented hierarchically – next slide.

Figure 2 Classify as IRClassify as non-IR

Alternative: Logistic Regression 1. Find equation: Prob(IR) = function(w 1 *BMI + w 2 *HOMA + w 3 *LDL +...) where the w’s are weights (coefficients). 2.Classify as IR if Prob(IR) is large enough. Note: Assumes specific statistical model. Gives p-values (which depend on model being correct). Need to use Prob(IR) which is very model- dependent, unlike High/Low categorizations.

Trees or Logistic Regression? Logistic: Not originally designed for classifying, but for finding Prob(IR). Requires specification of predictor interrelations, either known or through data examination; thus, not as flexible. Dependent on correct model. Can prove whether predictors are associated with IR. Trees: Designed for classifying. Interrelations not pre-specified, but detected in the analysis. Does not prove associations “beyond reasonable doubt”, as provided in regression.

Goals of this Session 1.Use simulated data to: Classify via logistic regression using 1 predictor. Classify via trees using 1 predictor. Show that results are identical Show how results differ when 2 predictors are used. 2. List options that must be specified with trees.

IR and HOMA Simulated Data: N=2138 with IR rate increasing with HOMA as in actual data in paper. Overall IR rate = 700/2138 = 33%.

IR and HOMA: Logistic Fit Fitted logistic model predicts probability of IR as: Prob(IR) = e u /(1 + e u ), where u= -4.83 + 0.933(HOMA) A logistic curve has this sigmoidal shape

Using Logistic Model for Classification The logistic model proves that risk of IR ↑ as HOMA ↑ (significance of the coefficient 0.933 is p<0.0001). How can we classify as IR or not based on HOMA? Use Prob(IR). –Need cutpoint c so that we classify as: Prob(IR) > c → classify as IR Prob(IR) ≤ c → classify as non-IR Regression does not supply c. It is chosen to balance sensitivity and specificity.

IR and HOMA: Logistic with Arbitrary Cutpoint If cutpoint c=0.50 is chosen, then we have: Sensitivity = 400/(440+260) = 62.9% Specificity = 1339/(1339+99) = 93.1% Assign IR Assign non-IR Actual IR: N=440 Non-IR: N=99 Actual IR: N=260 Non-IR: N=1339

IR and HOMA: Logistic with Other Cutpoints From SAS: Classification Table Correct Incorrect Percentages Prob Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG 0.100 700 0 1438 0 32.7 100.0 0.0 67.3. 0.200 567 1181 257 133 81.8 81.0 82.1 31.2 10.1 0.300 521 1277 161 179 84.1 74.4 88.8 23.6 12.3 0.400 485 1325 113 215 84.7 69.3 92.1 18.9 14.0 0.500 440 1339 99 260 83.2 62.9 93.1 18.4 16.3 0.600 386 1354 84 314 81.4 55.1 94.2 17.9 18.8 0.700 331 1363 75 369 79.2 47.3 94.8 18.5 21.3 0.800 272 1376 62 428 77.1 38.9 95.7 18.6 23.7 0.900 171 1404 34 529 73.7 24.4 97.6 16.6 27.4 Often, equal weight is given to misclassifying IR and non-IR, in order to choose the “optimal” cutpoint. Here, that gives cutpoint=0.37, with % correct=85.2%, sensitivity=71.7% and specificity=91.7%.

Using Classification Trees with One Predictor Choose every possible HOMA value and find sensitivity and specificity (as in creating a ROC curve). Assign relative weights to sensitivity and specificity, often equal, as previously. Need cutpoint h so that we classify as: HOMA > h → classify as IR HOMA ≤ h → classify as non-IR

IR and HOMA: Trees with Other Cutpoints Correct Incorrect Percentages HOMA Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG 3.5 588 1117 321 112 79.7 84.0 77.7 35.3 9.1 4.0 542 1237 201 158 83.2 77.4 86.0 27.1 11.3 4.5 506 1302 136 194 84.6 72.3 90.5 21.2 13.0 5.0 458 1333 105 242 83.8 65.4 92.7 18.7 15.4 5.5 400 1350 88 300 81.9 57.1 93.9 18.0 18.2 6.0 338 1363 75 362 79.6 48.3 94.8 18.2 21.0 6.5 286 1369 69 414 77.4 40.9 95.2 19.4 23.2 7.0 233 1390 48 467 75.9 33.3 96.7 17.1 25.1 7.5 174 1403 35 526 73.8 24.9 97.6 16.7 27.3 8.0 123 1416 22 577 72.0 17.6 98.5 15.2 29.0 8.5 60 1428 10 640 69.6 8.6 99.3 14.3 30.9 9.0 0 1438 0 700 67.3 0.0 100.0. 32.7 If equal weight is given to misclassifying IR and non-IR, then cutpoint=4.61, with % correct=85.2%, sensitivity=71.7% and specificity=91.7%.

IR and HOMA: Final, Simple Tree This is exactly the result from the logistic regression, since the logistic function is monotone in HOMA. See next slide. 700/2138 32.7% IR 198/1517 13.1% IR 502/621 80.8% IR HOMA≤4.61HOMA>4.61

IR and HOMA: Logistic equivalent to Tree Assign IR Assign non-IR Actual IR: N=502 Non-IR: N=119 Actual IR: N=198 Non-IR: N=1319 Assign IRAssign non-IR Actual IR: N=502 Non-IR: N=119 Actual IR: N=198 Non-IR: N=1319

Summary: Classifying IR from HOMA Logistic regression gives Prob(IR) = e u /(1 + e u ), where u= -4.83 + 0.933(HOMA), so Prob(IR) large is equivalent to HOMA large. This is not the case with 2 predictors, as in next slide. One Predictor Same Classification with Trees and Logistic because:

% IR %IR increases with both HOMA and BMI. Logistic regression fits a surface to these %s.

Classifying IR from HOMA and BMI Logistic regression gives Prob(IR) = e u /(1 + e u ), where u= -6.51 + 0.87(HOMA) + 0.07(BMI), so Prob(IR) large is equivalent to 0.87(HOMA) + 0.07(BMI) large. Values of HOMA and BMI satisfying this are in the next slide. Two Predictors Different Classification with Trees and Logistic because:

Classifying IR from HOMA and BMI: Logistic 27.528.9 3.60 4.65 BMI HOMA Non-IR IR Equation: 0.87(HOMA) + 0.07(BMI) = cutpoint Logistic regression forces a smooth partition such as the following, although adding HOMA-BMI interaction could give curvature to the demarcation line. Compare this to the tree partitioning on the next slide.

Classifying IR from HOMA and BMI: Trees 27.528.9 3.60 4.65 BMI HOMA IR Non-IR Trees partition HOMA-BMI combinations into subgroups, some of which are then combined as IR and non-IR. We now consider the steps and options that need to be specified in a tree analysis.

Classification Tree Steps There are several flavors of tree methods, each with many options, but most involve: Specifying criteria for predictive accuracy. Tree building. Tree building stopping rules. Pruning. Cross-validation.

Specifying criteria for predictive accuracy Misclassification cost generalizes the concept of misclassification rates so that some types of misclassifying are given greater weight. Relative weights, or costs are assigned to each type of misclassification. A prior probability of each outcome is specified usually as the observed prevalence of outcome in the data, but could be from previous research or for other populations. The costs and priors together give the criteria for balancing specificity and sensitivity. Observed prevalence and equal weights → minimizing overall misclassification.

Tree Building Recursively apply what we did for HOMA for each of the two resulting partitions, then for the next set, etc. Every factor is screened at every step. The same factor may be reused. Some algorithms allow certain linear combinations of factors (e.g., as logistic regression provides, AKA discriminant functions) to be screened. An “impurity measure” or “splitting function” specifies the criteria for measuring how different two potential new subgroups are. Some choices are “Gini”, chi-square and G-square.

Tree Building Stopping Rules It is possible to continue splitting and building a tree until all subgroups are “pure” with only one type of outcome. This may be too fine to be useful. One alternative is “minimum N”, to allow only pure or only subgroups of a minimum size. Another choice is “Fraction of objects” in which a minimum fraction of an outcome class, or a pure class is obtained.

Tree Pruning Pruning tries to solve the problem of lack of generalizability due to over-fitting the results to the data at hand. Start at the latest splits and measure the magnitude of the reduced misclassification due to that split. Remove the split if it is not large. How large is “not large”? This can be made at least objective, if not foolproof, by a complexity parameter related to the depth of the tree, i.e. number of levels of splits. Combining that with the misclassification cost function gives a “cost- complexity pruning”, used in this paper.

Cross Validation At least two data sets are used. The decision rule is built with training set(s), and applied to test set(s). If the misclassification cost for the test sets are similar to that for the training sets, then that decision rule is considered “validated”. With large datasets, as in business data mining, only one training and one test set is used. For smaller datasets, “v-fold cross-validation” is used. The data is randomly split into v sets. Each set serves as the test set once, with the combined remaining v-1 sets as the training set, and v-1 times as part of the training set, for v analyses. Average cost is compared to that for the entire set.

Classification Tree Software CART from salford-systems.com. Statistica from statsoft.com. SAS: in the Enterprise Miner module. SPSS has a module (Name ?).

Conclusions Trees are better able to detect complex associations with outcome. Associations from trees may not be as generalizable. There is not yet good probabilistic support for tree associations or their generalizability, similar to power and p-values. There is a logic with trees that is not apparent for many statistical methods. There are many options with trees. Trees are excellent exploratory tools. Trees are data mining.

Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.

Similar presentations

Presentation on theme: "Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.

Similar presentations

Presentation on theme: "Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic."— Presentation transcript:

Similar presentations

About project

Feedback