Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch9: Decision Trees 9.1 Introduction A decision tree:

Similar presentations


Presentation on theme: "Ch9: Decision Trees 9.1 Introduction A decision tree:"— Presentation transcript:

1 Ch9: Decision Trees 9.1 Introduction A decision tree:
(a) is a hierarchical data structure (b) implements the divide-and-conquer strategy A decision tree is composed of internal (decision) and terminal (leaf) nodes. Each decision node accompanies a test function with discrete outcomes labeling the branches. Each leaf node has a class label (for classification) or a numeric value (for regression)

2 A leaf node defines a localized region in which examples belong to the same class or have similar
values. The boundaries of the region are determined by the discriminants that coded in the decision nodes on the path from the root to the leaf node. Advantages of decision trees: i) Fast localization of the region covering the input ii) Interpretability

3 9.2 Univariate Trees Decision nodes
Univariate: Uses a single attribute, xi Numeric xi : binary split : xi > wm Discrete xi : n-way split for n possible values Multivariate: Uses all attributes, x Leave nodes Classification: class labels, or proportions Regression: numeric; r average, or local fit

4 Example: Numeric input, binary split
The leaf nodes define hyperrectangles in the input space.

5 Given a training set, many trees can code it.
Objective: Find the smallest one. Tree size is measured by: i) # nodes, ii) the complexity of decision nodes 9.2.1 Classification Trees The goodness of a split is quantified by an impurity measure. A split with minimal impurity is desirable because the smallest tree is desired. A split is pure if for all branches all the examples choosing a branch belong to the same class.

6 For node m, Nm examples reach m, belong to Ci
with The probability of an example x reaching node m belonging to class Ci , Node m is pure if are either 0 or 1. It can be a leaf node labeled with the class for which One possible measure of impurity is entropy The smaller, the purer.

7 In information theory, Let probability of occurrence If e occurs, we receive bits of information. Bit: the amount of information receives when any of two equally probable alternatives is observed, i.e., This means that if we know for sure that an event will occur, its occurrence provides no information at all.

8 。 Consider a sequence of symbols output from the
source with occurring probabilities Zero-memory source: the probability of sending each symbol is independent of symbols previously sent The amount of information received from each symbol is Entropy of the information source: the average amount of information received per symbol

9 Entropy measures the degree of disorder of a system.
The most disorderly system is the one whose symbols occur with equal probability. Proof: The difference in entropy between the two sources

10 If is a source with equiprobable symbols, then

11 The second term in (1) is zero
where information gain (relative entropy or Kullback-Leibler divergence between and )

12 In statistical mechanics
– Systems deal with ensembles, each of which contains a large number of identical small systems, e.g., thermodynamics, quantum mechanics -- The property of an ensemble is determined by the average behavior of constituent small systems. Example: A thermodynamic system that is initially at temperature is changed to T

13 At temperature T, sufficient time should be given in
order to allow the system to reach an equilibrium, in which the probability that a particle i has an energy following the Boltzmann distribution T : Kelvins temperature where : Boltzmann constant The system energy at the equilibrium state ( ) is in an average sense 13

14 The coarseness of particle i with energy :
: Partition function The coarseness of particle i with energy : The entropy of the system is the average coarseness of its constituent particles 14

15 Consider 2-class case, let
is an impurity measure function if is increasing in p on [0, 1/2] and decreasing in p on [1/2, 1] Examples: 1. Entropy: 2. Gini index: 3. Misclassification error:

16 CART Algorithm: If node m is pure, generate a leaf and stop, otherwise split and continue recursively Impurity after split: of Nm take branch j. belong to Ci For all attributes, calculate their split impurity and choose the one with the minimum impurity.

17

18 9.2.2 Regression Trees Difficulties with the CART Algorithm:
Splitting favors attributes with many values many values  many branches  less impurity Noise may lead to a very large tree if the purest tree is desired. 9.2.2 Regression Trees The goodness of a split is measured by the mean square error from the estimated value. For node m, Xm is the subset of X reaching node m, Let be the estimated value in node m.

19 The mean square error where After splitting:

20 If the error is acceptable, i. e
If the error is acceptable, i.e. , a leaf node is created and value is stored. If the error is not acceptable, examples reaching node m is split further such that the sum of the errors in the branches is minimized. Example: Different error thresholds

21

22 The CART algorithm can be modified to training a
regression tree by replacing (i) entropy with mean square error and (ii) class labels with averages. Another possible error function: worst possible error

23 9.3 Pruning Two types of pruning:
Prepruning : early stopping, e.g., small number of examples reaching a node Postpruning: grow the whole tree then prune unnecessary subtrees * Prepruning is faster, postpruning is more accurate

24 9.4 Rules Extraction from Trees
Example: IF-Then Rules: Rule support – the percentage of training data covered by the rule

25 9.6 Multivariate Trees At a decision node m, all input dimensions can be used to split the node. When all inputs are numeric A linear multivariate node:

26 A quadratic multivariate node:
Sphere node:


Download ppt "Ch9: Decision Trees 9.1 Introduction A decision tree:"

Similar presentations


Ads by Google