# Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning.

## Presentation on theme: "Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning."— Presentation transcript:

Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning Avoiding overfitting through pruningAvoiding overfitting through pruning Numeric and Missing attributesNumeric and Missing attributes

Illustration Example: Learning to classify stars. Luminosity Mass Type A Type B Type C > r1 <= r1 > r2 <= r2

Definition A decision-tree learning algorithm approximates a target concept using a tree representation, where each internal node corresponds to an attribute, and every terminal node corresponds to a class. A decision-tree learning algorithm approximates a target concept using a tree representation, where each internal node corresponds to an attribute, and every terminal node corresponds to a class. There are two types of nodes: There are two types of nodes: Internal node.- Splits into different branches according to the different values the corresponding attribute can take. Example: luminosity r1. Internal node.- Splits into different branches according to the different values the corresponding attribute can take. Example: luminosity r1. Terminal Node.- Decides the class assigned to the example. Terminal Node.- Decides the class assigned to the example.

Classifying Examples Luminosity Mass Type A Type B Type C > r1 <= r1 > r2 <= r2 X = (Luminosity r2) Assigned Class

Appropriate Problems for Decision Trees  Attributes are both numeric and nominal.  Target function takes on a discrete number of values.  Data may have errors.  Some examples may have missing attribute values.

Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning Avoiding overfitting through pruningAvoiding overfitting through pruning Numeric and Missing attributesNumeric and Missing attributes

Historical Information Ross Quinlan – Induction of Decision Trees. Machine Learning Journal 1: 81-106, 1986 (over 8 thousand citations)

Historical Information Leo Breiman – CART (Classification and Regression Trees), 1984.

Mechanism There are different ways to construct trees from data. We will concentrate on the top-down, greedy search approach: Basic idea: 1. Choose the best attribute a* to place at the root of the tree. 1. Choose the best attribute a* to place at the root of the tree. 2. Separate training set D into subsets { D1, D2,.., Dk } where 2. Separate training set D into subsets { D1, D2,.., Dk } where each subset Di contains examples having the same value for a* each subset Di contains examples having the same value for a* 3. Recursively apply the algorithm on each new subset until 3. Recursively apply the algorithm on each new subset until examples have the same class or there are few of them. examples have the same class or there are few of them.

Illustration Attributes: size and humidity Size has two values: >r1 or r1 or <= r1 Humidity has three values: >r2, (>r3 and r2, (>r3 and <=r2), <= r3 size humidity r1 r2 r3 Class P: poisonous Class N: not-poisonous

Illustration humidity r1 r2 r3 Suppose we choose size as the best attribute: size P > r1 <= r1 Class P: poisonous Class N: not-poisonous Class N: not-poisonous ?

Illustration humidity r1 r2 r3 Suppose we choose humidity as the next best attribute: size P > r1 <= r1 humidity P NP NP >r2 <= r3 > r3 & r3 & <= r2

Formal Mechanism Create a root for the tree Create a root for the tree If all examples are of the same class or the number of examples If all examples are of the same class or the number of examples is below a threshold return that class is below a threshold return that class If no attributes available return majority class If no attributes available return majority class Let a* be the best attribute Let a* be the best attribute For each possible value v of a* For each possible value v of a* Add a branch below a* labeled “a = v” Add a branch below a* labeled “a = v” Let Sv be the subsets of example where attribute a*=v Let Sv be the subsets of example where attribute a*=v Recursively apply the algorithm to Sv Recursively apply the algorithm to Sv

What attribute is the best to split the data? Let us remember some definitions from information theory. A measure of uncertainty or entropy that is associated to a random variable X is defined as H(X) = - Σ pi log pi where the logarithm is in base 2. This is the “average amount of information or entropy of a finite complete probability scheme” (Introduction to I. Theory by Reza F.).

P(A) = 1/256, P(B) = 255/256 P(A) = 1/256, P(B) = 255/256 H(X) = 0.0369 bit H(X) = 0.0369 bit P(A) = 1/2, P(B) = 1/2 P(A) = 1/2, P(B) = 1/2 H(X) = 1 bit H(X) = 1 bit P(A) = 7/16, P(B) = 9/16 P(A) = 7/16, P(B) = 9/16 H(X) = 0.989 bit H(X) = 0.989 bit There are two possible complete events A and B (Example: flipping a biased coin).

Entropy is a function concave downward. 0 0.51 1 bit

Illustration Attributes: size and humidity Size has two values: >r1 or r1 or <= r1 Humidity has three values: >r2, (>r3 and r2, (>r3 and <=r2), <= r3 size humidity r1 r2 r3 Class P: poisonous Class N: not-poisonous

Splitting based on Entropy sizer1 r2 r3 humidity Size divides the sample in two. S1 = { 6P, 0NP} S2 = { 3P, 5NP} S1 S2 H(S1) = 0 H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) -(5/8)log2(5/8)

Splitting based on Entropy sizer1 r2 r3 humidity humidity divides the sample in three. S1 = { 2P, 2NP} S2 = { 5P, 0NP} S3 = { 2P, 3NP} S1 S3 H(S1) = 1 H(S2) = 0 H(S3) = -(2/5)log2(2/5) -(3/5)log2(3/5) -(3/5)log2(3/5) S2

Information Gain IG(A) = H(S) - Σv (Sv/S) H (Sv) H(S) is the entropy of all examples H(Sv) is the entropy of one subsample after partitioning S based on all possible values of attribute A.

Components of IG(A) sizer1 r2 r3 humidity S1 S2 H(S1) = 0 H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) -(5/8)log2(5/8) H(S) = -(9/14)log2(9/14) -(5/14)log2(5/14) -(5/14)log2(5/14) |S1|/|S| = 6/14 |S2|/|S| = 8/14

Components of IG(A) sizer1 r2 r3 humidity S1 S2 H(S1) = 0 H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) -(5/8)log2(5/8) H(S) = -(9/14)log2(9/14) -(5/14)log2(5/14) -(5/14)log2(5/14) |S1|/|S| = 6/14 |S2|/|S| = 8/14

Gain Ratio Let’s define the entropy of the attribute: H(A) = - Σ pj log pj Where pj is the probability that attribute A takes value Vj. Then GainRatio(A) = IG(A) / H(A)

Gain Ratio sizer1 r2 r3 humidity S1 S2 H(size) = -(6/14)log2(6/14) -(8/14)log2(8/14) where |S1|/|S| = 6/14 |S2|/|S| = 8/14

Similar presentations